Research projects¶
There are some great resources on good data organization, such as the OSF’s guide. Here, I’ll document the aspects of my protocols that work well.
Main ideas¶
✅ DO: Designate data as either living or frozen. Code should never read from living files and should never write to frozen files.
✅ DO: Indicate manually curated / edited data as such. When ready, make a frozen copy and add a timestamp. Let scripts read the frozen copy.
❌ DO NOT: Have scripts read living data.
❌ DO NOT: Just organize into input/
and output/
directories.
✅ DO: Store basic machine-readable metadata alongside the data. (Don’t store information that can change, such as comments, project names, or people’s names.)
✅ DO: Reorganize your files if your current structure isn’t working.
Example¶
├── src/
│ └── pkg/
│ ├── __init__.py
│ ├── generate_raw_name_mapping.py
│ └── analyze.py
├── data/
│ ├── temp-output/
│ ├── raw-name-mapping.tsv
│ └── figures/
│ ├── living/
│ │ ├── name-mapping.tsv
│ │ └── figures/
│ │ │ └── samples.pdf
│ └── frozen/
│ ├── name-mapping-2022-01-14.tsv
│ ├── microscopy/
│ │ └── 20220615T164555.confocal3
│ │ ├── 20220615T164555.confocal3.tif
│ │ └── 20220615T164555.confocal3.json
│ └── reference/
│ └── weird-sample-analysis-2022-01-24.ipynb
│ └── weird-samples-2022-01-24.pdf
└── README.md
Here, temp-output/
and living/
both contain living data. temp-output/
can be overwritten any time, while living/
contains manually curated files. You should probably avoid checking output/
into a repository.
How the layout is used¶
src/pkg/generate_raw_name_mapping.py
outputs data/temp-output/raw-name-mapping.tsv
. Maybe it maps microscope filenames to sample names, but that mapping can’t be fully automated. So, we copy to data/living/name-mapping.tsv
and edit it. When we’re done, we copy it to data/frozen/name-mapping-2022-01-14.tsv
.
We then run src/pkg/analyze.py
, which outputs figures to data/temp-output/figures/
. They need a little tweaking, so we copy each to living/figures/
and align labels, etc. Or maybe we combine several figure panels into full figures. We want to keep this exact analysis, so we copy it to data/frozen/reference/
. And we write a script to generate that figure (in this case, a Jupyter notebook). The script writes to data/temp-output/
(not data/frozen
).
JSON metadata¶
What’s in 20220615T164555.confocal3.json
?
{
"id": "20220615T164555.confocal3",
"instrument": "confocal3",
"instrument-type": "microscope",
"datetime": "2022-06-15T16:45:55",
"duration-ms": 185.225075,
"roi": {
"type": "rectangle",
"values": {
"top": 4101,
"bottom": 4358,
"left": 528,
"right": 744
}
}
}