mkdir -p data/E18_Heart/fastq
curl https://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_1k_v3/heart_1k_v3_fastqs.tar -O data/E18_Heart/fastq/heart_1k_v3_fastqs.tar
tar -xvf data/E18_Heart/fastq/heart_1k_v3_fastqs.tar
mv heart_1k_v3_fastqs/ data/E18_Heart/fastq/
ls --color -ltFh data/E18_Heart/fastq/heart_1k_v3_fastqs
zcat data/E18_Heart/fastq/heart_1k_v3_fastqs/heart_1k_v3_S1_L001_R1_001.fastq.gz | head
zcat data/E18_Heart/fastq/heart_1k_v3_fastqs/heart_1k_v3_S1_L001_R2_001.fastq.gz | head
zcat data/E18_Heart/fastq/heart_1k_v3_fastqs/heart_1k_v3_S1_L001_I1_001.fastq.gz | headDemonstration: From fastq to count matrix
Goals:
- Introduce the
cellrangertoolkit
1. Download sequencing reads in fastq format
Here we will process a single-cell RNA-seq dataset provided by 10X Genomics, as an example.
Here is the link to the dataset: link
This is a single-cell RNA-seq sample from mouse embryonic (E18) heart cells. Let’s first download the raw fastqs.
2. Prepare genome for alignment
Download GRCm38 genome reference and gene annotations from iGenomes to ensure that genome reference and gene annotations are uniformly processed.
# Download files
mkdir data/E18_Heart/GRCm38/ && cd data/E18_Heart/GRCm38
curl http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Mus_musculus/Ensembl/GRCm38/Mus_musculus_Ensembl_GRCm38.tar.gz
tar -xzvf Mus_musculus_Ensembl_GRCm38.tar.gz
# Clean up gtf file to remove unscaffolded contigs
grep -vP "^CHR|^GL|^JH" Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes.gtf > Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes_filtered.gtf
cut -f1 Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes_filtered.gtf | uniq -c
# Build cellranger index
cellranger mkref \
--genome=GRCm38 \
--fasta=Mus_musculus/Ensembl/GRCm38/Sequence/WholeGenomeFasta/genome.fa \
--genes=Mus_musculus/Ensembl/GRCm38/Annotation/Genes/genes_filtered.gtf \
--nthreads=18 \
--memgb=40
cd ../../../
ls --color -lthF data/E18_Heart/GRCm38/GRCm38If you fail to download/build Cellranger index, you can always get another version from here:
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz
#md5sum: 886eeddde8731ffb58552d0bb81f533d
tar -xzvf refdata-gex-mm10-2020-A.tar.gz
ls --color -lthF data/E18_Heart/GRCm38/GRCm383. Map reads onto 10X-formatted genome with cellranger
mkdir -p data/E18_Heart/cellranger && cd data/E18_Heart/cellranger
cellranger count \
--id=E18_Heart \
--transcriptome=../../../data/E18_Heart/GRCm38/GRCm38 \
--fastqs=../../../data/E18_Heart/fastq/heart_1k_v3_fastqs \
--expect-cells=1000 \
--localcores=18 \
--localmem=40
cd ../../Mapping takes a significant amount of time. On my machine, it takes nearly ~ 1h to finish.
This should appear at the end of the mapping/counting process:
## Outputs:
## - Run summary HTML: E18_Heart/web_summary.html
## - Run summary CSV: E18_Heart/metrics_summary.csv
## - BAM: E18_Heart/possorted_genome_bam.bam
## - BAM index: E18_Heart/possorted_genome_bam.bam.bai
## - Filtered feature-barcode matrices MEX: E18_Heart/filtered_feature_bc_matrix
## - Filtered feature-barcode matrices HDF5: E18_Heart/filtered_feature_bc_matrix.h5
## - Unfiltered feature-barcode matrices MEX: E18_Heart/raw_feature_bc_matrix
## - Unfiltered feature-barcode matrices HDF5: E18_Heart/raw_feature_bc_matrix.h5
## - Secondary analysis output CSV: E18_Heart/analysis
## - Per-molecule read information: E18_Heart/molecule_info.h5
## - Loupe Browser file: E18_Heart/cloupe.cloupe
## Waiting 6 seconds for UI to do final refresh.
## Pipestance completed successfully!4. Check output files
A description of the different output files is available here.
ls --color -lthF data/E18_Heart/cellranger/E18_Heart/
ls --color -lthF data/E18_Heart/cellranger/E18_Heart/outsWe can check the html summary report in a web browser to get more insights about the results of the scRNAseq experiment.
data/E18_Heart/cellranger/E18_Heart/outs/web_summary.htmlIf samtools is installed, one can also check the bam file obtained using cellranger count workflow.
samtools view data/E18_Heart/cellranger/E18_Heart/outs/possorted_genome_bam.bam | head -n 10
samtools flagstat data/E18_Heart/cellranger/E18_Heart/outs/possorted_genome_bam.bamThe analysis folder contains relevant(ish) information obtained from after a rough post-alignment processing of the dataset by cellranger.
tree -L 2 data/E18_Heart/cellranger/E18_Heart/outs/analysis/
head data/E18_Heart/cellranger/E18_Heart/outs/analysis/clustering/graphclust/clusters.csv
cut -f 2 -d, data/E18_Heart/cellranger/E18_Heart/outs/analysis/clustering/graphclust/clusters.csv | sort | uniq -c