Analysis workflow
Data¶
- Raw data and sample annotations for each experiment will be stored in AWS S3. Do this even for experiments where the data is already on the Fred Hutch cluster, such as deep sequencing data.
- Use the link and login info here to access the AWS S3 bucket.
- Data for any single experiment should go into the folder
fh-pi-subramaniam-a-eco/data/REPO-NAME/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/
. - Every experiment folder much have an associated
sample_annotations.csv
file in the same folder. See here for how to create this file. You should createsample_annotations.csv
even for assays that yield a single image such as Western blots. The only exceptions are agarose gel images, Sanger sequencing data, and Nanodrop measurements (only those measurements that are unlikely to be used in a publication or analyzed quantitatively). - You can login to AWS S3 from any computer connected to the internet and upload your data files using the web interface. Alternately, you can also use the AWS CLI command
cp
to copy from your local computer to the AWS S3 bucket after you create the bucket. See example below.
Analyses¶
- All analysis should be performed in the project GitHub repository cloned to
/fh/scratch/delete90/subramaniam_a/USER-NAME/git/REPO-NAME
. - Create a new branch for each analysis and merge it with
main
ormaster
when the analysis is complete. - Analysis scripts for any single experiment should go into the folder
analysis/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/
. Use the same folder organization and naming convention as what you used to store the data on S3. - Each analysis folder should contain a
README.md
file at the top level. All other files should be stored in one ofdata
,annotations
,scripts
,figures
,tables
subfolders. - Raw data and sample annotations should be copied from AWS S3 into
analysis/USER-NAME/EXPT-TYPE/YYYYMMDD_EXPT-NAME_EXPT-DESCRIPTION/data/
. For multistep analysis such deep sequencing where you generate lot of processed data, store your raw data in a subfolder ofdata
such asfastq
. - Copy files from AWS S3 to the
data
folder using the followingaws cli
command. Store this command asaws_s3_cp.sh
in thedata
folder and commit it to GitHub so that you can easily run it again the next time you want to perform the analysis.
aws cp s3://fh-pi-subramaniam-a-eco/data/REPO-NAME/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/ .
- Commit README.md, scripts and small image and tables (<1MB) to GitHub.
- All tables should be in CSV or TSV format and not in XLS or ODS format.
- All figures should be PDF format so that they can be edited in Inkscape. The only exceptions are scatter plots with lot of points where a PNG is suitable to save space.
- The analysis should be fully reproducible by running a single script such as a Jupyter Notebook or a Snakemake pipeline. Note that all intermediate files will be automatically deleted after 90 days in the
scratch
folder.
Software¶
All analysis should be performed in singularity
containers or conda
environments that can be reproducibly created from a Dockerfile
or an environment.yml
file, respectively.
We perform simple analyses on the cluster from within VSCode using the Remote SSH extension. For more comple analyses that need to run over multiple days, we run commands from tmux
session on the cluster to preserve the session if the connection is lost.
Simple analyses¶
For quick analysis on the cluster with a single script (such as for a flow experiment), use one of the pre-created conda
environments that mirrors our lab's project_repo
folder. You can list all available environments in the cluster using the following command. You should be also able to see the corresponding Jupyter kernel in VSCode.
```bash
conda info --envs
```
If you want to create such an environment from a Dockerfile
, convert the mamba create
and mamba install
commands for that environment into a shell script and run it.
Multi-step analyses¶
- Either use one of our lab's predefined
singularity
containers pulled from our lab's package repository or create your ownsingularity
container from aDockerfile
file. -
To pull a
singularity
container from our lab's package repository, use the following command. Use a specific version number such as1.3.0
instead oflatest
so that you can identify the container version easily. -
To create your own
singularity
containers, follow TBD steps to create aDocker
container from aDockerfile
and push it to the lab package repository. Then pull the container from the package repository as above. - Incorporate your container into a
snakemake
workflow by adding thecontainer: LOCATION_OF_SIF_FILE
directive inside aSnakefile
, and using the--use-singularity
flag when runningsnakemake
. For example of how to usesnakemake
withsingularity
containers, see this example and this example. Note that you can use a singlesingularity
container and refer to specificconda
environments within that container for individual rules in theSnakefile
. You can also specific unique containers for each rule in theSnakefile
.
Guidelines for specific assays¶
Agarose gel images¶
- We usually do not need high-resolution images for analysis or publication, so save the image in
.jpg
format from the Gel-Doc computer. - To save time, all agarose gels can be in a single folder at
fh-pi-subramaniam-a-eco/data/USER-NAME/agarose_gels
with the nameYYYYMMDD_expNN_short_description_image.jpg
. - You can annotate the
jpg
image using Inkscape. You can also invert the color and crop the image if necessary. You can then export it as PNG for uploading. - The annotated gel image should be uploaded to the project or your GitHub repo.