Exercises: From bcl to count matrix

Goals:

Understand the structure of raw sequencing files, fastq files, and output of cellranger workflow.
Execute the cellranger pipeline (mkfastq + count) to see how things work!
Learn more about public data access and recovery.

0. Introduction to `shell` terminal

shell (sh) is a software used to interpret commands typed in a terminal. It exists in both Mac and Linux environments.

The basic sh commands are useful to:

Navigate within directories
Manage files organization
Launch command-line-based softwares (e.g. cellranger)

Here are some of the most important commands:

Check your working directory

pwd

Check history

history

put history into a history.txt file

history > history.txt

make a new folder called data

mkdir data

Go to the new data directory

cd data

move history.txt file into data directory

mv ../history.txt ./

check manual page of curl command

man curl

check specific help for cellranger command and subcommands

cellranger --help
cellranger count --help

redirect cellranger count help output into a file called cellranger-help.txt

cellranger count --help > cellranger-help.txt

Download a file from Internet with curl

curl https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz

List all files in a folder

ls -l ~/
ls --color -Flh ~/

1. Prepare a place in your computer where you will follow the workshop

Create a directory for the workshop

Question

Open a terminal and navigate to your preferred location for the workshop.

Answer

# Create a directory for the workshop 
cd ${HOME}
mkdir scRNAseq_Jan24
cd ${HOME}/scRNAseq_Jan24/

From now on, everything you do should take place in this folder! Be sure you have enough storage space in the filesystem you are using, as you will need lots of it!

Clone github directory in the workshop directory

Question

Download the git repository for this course from GitHub

Answer

cd ${HOME}/scRNAseq_Jan24/
git clone https://github.com/js2264/scRNAseq-workshop.git

This downloads the repository for this course to your home folder on the AWS machine.
To get it on your local computer (to save the lectures and exercises), you can also go to the GitHub repo page, click on the green Code button, then Download ZIP. Beware, the download may take a significant time based on your internet connection (several hundreds MB).

2. Process raw files into fastq files

NOTE: This is a step typically performed internally by sequencing platform, which delivers .fastq files rather than .bcl files.

First, familiarize yourself with cellranger mkfastq documentation: go to cellranger mkfastq webpage and read the Overview.

Question

What is the command you are going to use? What are the required and optional arguments for this command?

Answer

An alternative to the web-based documentation is to use the command-line help:

cellranger mkfastq --help

Getting input toy dataset

Let’s download a toy dataset to process into fastq files. A bcl tiny file is available and provided by 10X Genomics at the following adress: https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz.

Question

Download the indicated bcl files and unzip it in a subdirectory called data/bcl2fastq/.

Answer

cd ${HOME}/scRNAseq_Jan24/
mkdir -p data/bcl2fastq/
curl https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz -o data/bcl2fastq/cellranger-tiny-bcl-1.2.0.tar.gz
tar -xzvf data/bcl2fastq/cellranger-tiny-bcl-1.2.0.tar.gz && mv cellranger-tiny-bcl-1.2.0/ data/bcl2fastq/

Question

Explore the contents of the sequencing directory. What does each file correspond to? Can you locate the actual “sequencing” files?

Answer

ls --color -ltFh data/bcl2fastq/cellranger-tiny-bcl-1.2.0

Question

Alternatively, you can use the tree command (if available in your system!) to list the content of the cellranger-tiny-bcl-1.2.0 directory:

Answer

tree -L 4 data/bcl2fastq/cellranger-tiny-bcl-1.2.0/

Running `cellranger mkfastq`

Question

Do we have all the required files to run the cellranger mkfastq workflow? What about a samplesheet?

Answer

Normally, when sequencing a library, a samplesheet is provided to the Illumina sequencing machine. In our case, we don’t have direct access to this sample sheet. Regardless, we can create one manually. Here are the info for the different samples which were sequenced in this toy dataset

echo "Lane,Sample,Index
1,test_sample1,SI-GA-E3
1,test_sample2,SI-GA-F3
1,test_sample3,SI-GA-G3
1,test_sample4,SI-GA-H3
" > data/bcl2fastq/cellranger-tiny-bcl-samplesheet.csv

Question

What does each column corresponds to? How is this going to be used when generating fastq files?

Question

Now that we have a samplesheet ready, let’s launch the cellranger mkfastq workflow.

Answer

cd ${HOME}/scRNAseq_Jan24/data/bcl2fastq/
cellranger mkfastq \
    --id=tiny-bcl \
    --run=cellranger-tiny-bcl-1.2.0/ \
    --csv=cellranger-tiny-bcl-samplesheet.csv
cd ${HOME}/scRNAseq_Jan24/

Watch out the memory usage! For mkfastq command with human genome, at least 32 Gb of RAM are required!

Question

What are the different files generated by this workflow?

Answer

Once the conversion is achieved, the output folders can be viewed by running the ls command:

ls --color -ltFv data/bcl2fastq/tiny-bcl/
ls --color -ltFv data/bcl2fastq/tiny-bcl/outs/fastq_path/H35KCBCXY/test_sample1/
### Or ...
tree -L 3 data/bcl2fastq/tiny-bcl/

Question

How many fastq files have been generated? What does each one correspond to?
Look at the index read (I1), read 1 (R1), and read (R2) files using the command zcat <FASTQ_FILE_NAME>.gz | head. What does each file contain?
Open the html file tiny-bcl/outs/fastq_path/Reports/html/index.html. Take some time to explore the demultiplexed outputs.

3. Generate gene count matrices with cellranger count

Familiarize yourself with the cellranger count documentation available here: cellranger count algorithm overview. Notably, read the section on Alignment (Read Trimming, Genome Alignment, MAPQ adjustment, Transcriptome Alignment, UMI Counting).

Question

Which files are required for this step? Do we have all we need? Where is the index genome located?

Download genome index for the toy dataset

mm10 pre-processed cellranger-formatted genome reference index is available here.

Question

Download it in a subdirectory named ${HOME}/scRNAseq_Jan24/

Answer

curl https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz -o refdata-gex-mm10-2020-A.tar.gz
tar -xzvf refdata-gex-mm10-2020-A.tar.gz && mv refdata-gex-mm10-2020-A/ data/bcl2fastq/

ls --color -ltFh data/bcl2fastq/refdata-gex-mm10-2020-A/*

Running `cellranger count`

Question

In the terminal, run the count command.

Answer

cd ${HOME}/scRNAseq_Jan24/data/bcl2fastq/
cellranger count \
    --id=counts \
    --transcriptome=refdata-gex-mm10-2020-A \
    --fastqs=tiny-bcl/outs/fastq_path/ \
    --sample=test_sample1

While the count command is running, read about the format of the feature-barcode matrices.

Checking `count` output files

Once the count command is finished running, the pipeline outputs can be viewed as follows:

ls --color -ltFh counts/
ls --color -ltFh counts/outs/
### Or ...
tree -L 4 counts/

Question

Can you locate the feature-barcode matrices? What is the difference between the raw_feature_bc_matrix and filtered_feature_bc_matrix data types? In term of storage size?
Open the html file counts/outs/web_summary.html. Take some time to explore the gene expression matrix outputs.
How many clusters seem to be found? What are the main markers associated with each cluster?
Can you speculate what the main difference(s) is between the clusters?
Do the different metrics suggest that this sample contains good-quality data?

3 [Alternative] Generate gene count matrices with STARsolo

# Install STAR
conda install -c bioconda star 

# Build STAR index
curl https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz -o refdata-gex-mm10-2020-A.tar.gz
tar -xzvf refdata-gex-mm10-2020-A.tar.gz && mv refdata-gex-mm10-2020-A/ data/bcl2fastq/
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir data/bcl2fastq/ --genomeFastaFiles data/bcl2fastq/refdata-gex-mm10-2020-A/fasta/genome.fa  --sjdbGTFfile data/bcl2fastq/refdata-gex-mm10-2020-A/genes/genes.gtf
STAR_GENOME_DIR=data/bcl2fastq/refdata-gex-mm10-2020-A/star/

# Get barcode whitelist
curl https://raw.githubusercontent.com/10XGenomics/cellranger/master/lib/python/cellranger/barcodes/737K-august-2016.txt -o data/bcl2fastq/737K-august-2016.txt
BC_WHITELIST_FILE=data/bcl2fastq/737K-august-2016.txt

# Run STAR
STAR \
    --genomeDir "${STAR_GENOME_DIR}" \
    --soloType CB_UMI_Simple \
    --soloCBwhitelist "${BC_WHITELIST_FILE}" \
    --readFilesIn data/bcl2fastq/tiny-bcl/outs/fastq_path/Undetermined_S0_L001_R2_001.fastq.gz data/bcl2fastq/tiny-bcl/outs/fastq_path/Undetermined_S0_L001_R1_001.fastq.gz

4. Obtain single-cell RNA-seq datasets

“This is a course about single-cell RNA-seq analysis, right, so where is my data?”

Ok, “your” data is (most likely) yet to be sequenced! Or maybe you’re interested in digging already existing databases! I mean, who isn’t interested in this mind-blowing achievement from 10X Genomics??

Human Cell Atlas is probably a good place to start digging, if you are interested in mammal-related studies. For instance, let’s say I am interested in epididymis differentiation. Boom: here is an entry from the HCA focusing on epididymis: link to HCA data portal.

Raw fastq reads from GEO

Here is the link to the actual paper studying epididymis:
An atlas of human proximal epididymis reveals cell-specific functions and distinct roles for CFTR.

Question

Find and check out the corresponding GEO entries for this study. What type of sequencing data is available?

Here is the link to the GEO page: link.

Question

Can you find links to download the raw data from this paper?

There are several ways to find this information, e.g. ffq command line tool, or using the web-based sra-explorer page (here). You generally will need the GEO corresponding ID or SRA project ID (e.g. SRPxxxxxx…).

Question

Try to install and use ffq tool from the Patcher lab.

Answer

conda install -c bioconda ffq
ffq --help
ffq -t GSE GSE148963

Question

Can you find the links to raw data associated with the GSE148963 GEO ID?

Answer

You should use a grep command: grep returns the lines which match a given pattern (e.g. a link…)!

ffq -t GSE GSE148963 | grep 'ftp://'

And with a bit of sed magick…

ffq -t GSE GSE148963 | grep 'ftp://' | sed 's,.*ftp:,ftp:,' | sed 's,".*,,' > GSE148963_fastqlist.txt
# wget -i GSE148963_fastqlist.txt ## Do not run, it would take too long...

[BONUS] Pre-processed count matrices

Many times, researchers will provide a filtered count matrix when they publish scRNAseq experiments (along with mandatory raw fastq data, of course). It’s way lighter than fastq reads, and you can go ahead with downstream analyses a lot quicker. So how do you get these matrices?

Human Cell Atlas Consortium provides many processed datasets. For instance, in our case, the Leir et al study is available at the following link: https://data.humancellatlas.org/explore/projects/842605c7-375a-47c5-9e2c-a71c2c00fcad.
GEO also hosts processed files.

Question

Find GEO-hosted processed files for the Leir et al study.

You can download some of the processed files available in GEO from the following webpage. Scrolling down to the bottom of the page, there is a box labelled “Supplementary data”. By clicking on “(custom)”, a list of extra supplementary files will appear.

Download and check the content of the count matrix, the genes and the barcodes files.
What type of information does each file contain? How is it formatted? is it easily imported in R?
How many cells were sequenced? How many genes were counted?
Is it easy to interpret the count matrix? Why is it in such format?
Comment on the file sizes between processed count matrix files and raw reads.

0. Introduction to shell terminal

1. Prepare a place in your computer where you will follow the workshop

Create a directory for the workshop

Clone github directory in the workshop directory

2. Process raw files into fastq files

Getting input toy dataset

Running cellranger mkfastq

3. Generate gene count matrices with cellranger count

Download genome index for the toy dataset

Running cellranger count

Checking count output files

3 [Alternative] Generate gene count matrices with STARsolo

4. Obtain single-cell RNA-seq datasets

Raw fastq reads from GEO

[BONUS] Pre-processed count matrices

0. Introduction to `shell` terminal

Running `cellranger mkfastq`

Running `cellranger count`

Checking `count` output files