Skip to content

Analysis workflow

Data

  • Raw data and sample annotations for each experiment will be stored in AWS S3. Do this even for experiments where the data is already on the Fred Hutch cluster, such as deep sequencing data.
  • Use the link and login info here to access the AWS S3 bucket.
  • Data for any single experiment should go into the folder fh-pi-subramaniam-a-eco/data/REPO-NAME/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/.
  • Every experiment folder much have an associated sample_annotations.csv file in the same folder. See here for how to create this file. You should create sample_annotations.csv even for assays that yield a single image such as Western blots. The only exceptions are agarose gel images, Sanger sequencing data, and Nanodrop measurements (only those measurements that are unlikely to be used in a publication or analyzed quantitatively).
  • You can login to AWS S3 from any computer connected to the internet and upload your data files using the web interface. Alternately, you can also use the AWS CLI command cp to copy from your local computer to the AWS S3 bucket after you create the bucket. See example below.

Analyses

  • All analysis should be performed in the project GitHub repository cloned to /fh/scratch/delete90/subramaniam_a/USER-NAME/git/REPO-NAME.
  • Create a new branch for each analysis and merge it with main or master when the analysis is complete.
  • Analysis scripts for any single experiment should go into the folder analysis/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/. Use the same folder organization and naming convention as what you used to store the data on S3.
  • Each analysis folder should contain a README.md file at the top level. All other files should be stored in one of data, annotations, scripts, figures, tables subfolders.
  • Raw data and sample annotations should be copied from AWS S3 into analysis/USER-NAME/EXPT-TYPE/YYYYMMDD_EXPT-NAME_EXPT-DESCRIPTION/data/. For multistep analysis such deep sequencing where you generate lot of processed data, store your raw data in a subfolder of data such as fastq.
  • Copy files from AWS S3 to the data folder using the following aws clicommand. Store this command as aws_s3_cp.sh in the data folder and commit it to GitHub so that you can easily run it again the next time you want to perform the analysis.
aws cp s3://fh-pi-subramaniam-a-eco/data/REPO-NAME/USER-NAME/EXPT-TYPE/YYYYMMDD_ISSUE-NUMBER_EXPT-NUMBER_EXPT-DESCRIPTION/ .
  • Commit README.md, scripts and small image and tables (<1MB) to GitHub.
  • All tables should be in CSV or TSV format and not in XLS or ODS format.
  • All figures should be PDF format so that they can be edited in Inkscape. The only exceptions are scatter plots with lot of points where a PNG is suitable to save space.
  • The analysis should be fully reproducible by running a single script such as a Jupyter Notebook or a Snakemake pipeline. Note that all intermediate files will be automatically deleted after 90 days in the scratch folder.

Software

All analysis should be performed in singularity containers or conda environments that can be reproducibly created from a Dockerfile or an environment.yml file, respectively.

We perform simple analyses on the cluster from within VSCode using the Remote SSH extension. For more comple analyses that need to run over multiple days, we run commands from tmux session on the cluster to preserve the session if the connection is lost.

Simple analyses

For quick analysis on the cluster with a single script (such as for a flow experiment), use one of the pre-created conda environments that mirrors our lab's project_repo folder. You can list all available environments in the cluster using the following command. You should be also able to see the corresponding Jupyter kernel in VSCode.

```bash 
conda info --envs
```

If you want to create such an environment from a Dockerfile, convert the mamba create and mamba install commands for that environment into a shell script and run it.

Multi-step analyses

  • Either use one of our lab's predefined singularity containers pulled from our lab's package repository or create your own singularity container from a Dockerfile file.
  • To pull a singularity container from our lab's package repository, use the following command. Use a specific version number such as 1.3.0 instead of latest so that you can identify the container version easily.

    module load Singularity # Fred Hutch specific.
    singularity pull docker://ghcr.io/rasilab/<container_name>:VERSION_NUMBER
    
  • To create your own singularity containers, follow TBD steps to create a Docker container from a Dockerfile and push it to the lab package repository. Then pull the container from the package repository as above.

  • Incorporate your container into a snakemake workflow by adding the container: LOCATION_OF_SIF_FILE directive inside a Snakefile, and using the --use-singularity flag when running snakemake. For example of how to use snakemake with singularity containers, see this example and this example. Note that you can use a single singularity container and refer to specific conda environments within that container for individual rules in the Snakefile. You can also specific unique containers for each rule in the Snakefile.

Guidelines for specific assays

Agarose gel images

  • We usually do not need high-resolution images for analysis or publication, so save the image in .jpg format from the Gel-Doc computer.
  • To save time, all agarose gels can be in a single folder at fh-pi-subramaniam-a-eco/data/USER-NAME/agarose_gels with the name YYYYMMDD_expNN_short_description_image.jpg.
  • You can annotate the jpg image using Inkscape. You can also invert the color and crop the image if necessary. You can then export it as PNG for uploading.
  • The annotated gel image should be uploaded to the project or your GitHub repo.