pwdExercises: From bcl to count matrix
Goals:
- Understand the structure of raw sequencing files, fastq files, and output of
cellrangerworkflow. - Execute the
cellrangerpipeline (mkfastq+count) to see how things work! - Learn more about public data access and recovery.
0. Introduction to shell terminal
shell (sh) is a software used to interpret commands typed in a terminal. It exists in both Mac and Linux environments.
The basic sh commands are useful to:
- Navigate within directories
- Manage files organization
- Launch command-line-based softwares (e.g.
cellranger)
Here are some of the most important commands:
- Check your working directory
- Check history
history- put history into a
history.txtfile
history > history.txt- make a new folder called data
mkdir data- Go to the new
datadirectory
cd data- move
history.txtfile intodatadirectory
mv ../history.txt ./- check manual page of
curlcommand
man curl- check specific help for
cellrangercommand and subcommands
cellranger --help
cellranger count --help- redirect
cellranger counthelp output into a file calledcellranger-help.txt
cellranger count --help > cellranger-help.txt- Download a file from Internet with
curl
curl https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz- List all files in a folder
ls -l ~/
ls --color -Flh ~/1. Prepare a place in your computer where you will follow the workshop
Create a directory for the workshop
From now on, everything you do should take place in this folder! Be sure you have enough storage space in the filesystem you are using, as you will need lots of it!
Clone github directory in the workshop directory
This downloads the repository for this course to your home folder on the AWS machine.
To get it on your local computer (to save the lectures and exercises), you can also go to the GitHub repo page, click on the green Code button, then Download ZIP. Beware, the download may take a significant time based on your internet connection (several hundreds MB).
2. Process raw files into fastq files
NOTE: This is a step typically performed internally by sequencing platform, which delivers .fastq files rather than .bcl files.
First, familiarize yourself with cellranger mkfastq documentation: go to cellranger mkfastq webpage and read the Overview.
Getting input toy dataset
Let’s download a toy dataset to process into fastq files. A bcl tiny file is available and provided by 10X Genomics at the following adress: https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz.
Running cellranger mkfastq
Watch out the memory usage! For mkfastq command with human genome, at least 32 Gb of RAM are required!
3. Generate gene count matrices with cellranger count
Familiarize yourself with the cellranger count documentation available here: cellranger count algorithm overview. Notably, read the section on Alignment (Read Trimming, Genome Alignment, MAPQ adjustment, Transcriptome Alignment, UMI Counting).
Download genome index for the toy dataset
mm10 pre-processed cellranger-formatted genome reference index is available here.
Running cellranger count
While the count command is running, read about the format of the feature-barcode matrices.
Checking count output files
Once the count command is finished running, the pipeline outputs can be viewed as follows:
ls --color -ltFh counts/
ls --color -ltFh counts/outs/
### Or ...
tree -L 4 counts/3 [Alternative] Generate gene count matrices with STARsolo
# Install STAR
conda install -c bioconda star
# Build STAR index
curl https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz -o refdata-gex-mm10-2020-A.tar.gz
tar -xzvf refdata-gex-mm10-2020-A.tar.gz && mv refdata-gex-mm10-2020-A/ data/bcl2fastq/
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir data/bcl2fastq/ --genomeFastaFiles data/bcl2fastq/refdata-gex-mm10-2020-A/fasta/genome.fa --sjdbGTFfile data/bcl2fastq/refdata-gex-mm10-2020-A/genes/genes.gtf
STAR_GENOME_DIR=data/bcl2fastq/refdata-gex-mm10-2020-A/star/
# Get barcode whitelist
curl https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/737K-august-2016.txt -o data/bcl2fastq/737K-august-2016.txt
BC_WHITELIST_FILE=data/bcl2fastq/737K-august-2016.txt
# Run STAR
STAR \
--genomeDir "${STAR_GENOME_DIR}" \
--soloType CB_UMI_Simple \
--soloCBwhitelist "${BC_WHITELIST_FILE}" \
--readFilesIn data/bcl2fastq/tiny-bcl/outs/fastq_path/Undetermined_S0_L001_R2_001.fastq.gz data/bcl2fastq/tiny-bcl/outs/fastq_path/Undetermined_S0_L001_R1_001.fastq.gz4. Obtain single-cell RNA-seq datasets
“This is a course about single-cell RNA-seq analysis, right, so where is my data?”
Ok, “your” data is (most likely) yet to be sequenced! Or maybe you’re interested in digging already existing databases! I mean, who isn’t interested in this mind-blowing achievement from 10X Genomics??
Human Cell Atlas is probably a good place to start digging, if you are interested in mammal-related studies. For instance, let’s say I am interested in epididymis differentiation. Boom: here is an entry from the HCA focusing on epididymis: link to HCA data portal.
Raw fastq reads from GEO
Here is the link to the actual paper studying epididymis:
An atlas of human proximal epididymis reveals cell-specific functions and distinct roles for CFTR.
Here is the link to the GEO page: link.
There are several ways to find this information, e.g. ffq command line tool, or using the web-based sra-explorer page (here). You generally will need the GEO corresponding ID or SRA project ID (e.g. SRPxxxxxx…).
[BONUS] Pre-processed count matrices
Many times, researchers will provide a filtered count matrix when they publish scRNAseq experiments (along with mandatory raw fastq data, of course). It’s way lighter than fastq reads, and you can go ahead with downstream analyses a lot quicker. So how do you get these matrices?
- Human Cell Atlas Consortium provides many processed datasets. For instance, in our case, the
Leir et alstudy is available at the following link: https://data.humancellatlas.org/explore/projects/842605c7-375a-47c5-9e2c-a71c2c00fcad. - GEO also hosts processed files.