How many cells were sequenced in this dataset? How many genes are profiled?
What is the distribution of genes being detected per cell, and of number of unique transcripts being detected per cell? And for each gene, what is the distribution of number of cells it is detected in? Use QC functions from the scuttle package to automatically compute these metrics.
What is the sparsity of the data (in other words, how dense is the count matrix)?
What are the different annotations available for the cells?
What are the tissues used for cell profiling? Which type of cells come from which tissue?
Answer
# Check cell type annotationscolData(zeisel)table(zeisel$level1class)table(zeisel$level2class)table(zeisel$level2class, zeisel$level1class)# Check tissue of origintable(zeisel$tissue)table(zeisel$level1class, zeisel$tissue)
UMI number / cell
A useful diagnostic for scRNAseq data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows users to examine the distribution of total counts across barcodes, focusing on those with the largest counts.
This diagnostic plot can be generated using DropletUtils package, notably the barcodeRanks() function.
Question
Try to use barcodeRanks() function to compute and plot UMI # / cell across all the cells in either 10X Genomics-generated data or Zeisel’s data.
Comment the difference. Again, what are the important differences in term of single-cell approaches used here?
An important step when getting your hands on droplet-based single-cell RNA-seq data is to be confident you are working with actual cell data. This means knowing how to deal with/remove (1) emtpy droplets and (2) droplets containing doublets.
Removing empty droplets
The ambient RNA “soup” sometimes makes it difficult to differentiate empty droplets from droplets containing cells with low amounts of RNA.
emptyDrops() function from the DropletUtils package provides a methodology to (1) estimate the ambient RNA contamination and then (2) compute a probability that each droplet contains a cell.
Since emptyDrops() assumes that most of the droplets in a matrix are empty, one needs to start the analysis from the raw unfiltered count matrix provided by 10X Genomics. Download the E18 mouse heart scRNAseq raw unfiltered count matrix from 10X Genomics here.
Use emptyDrops() to differentiate empty droplets from cell-containing droplets.
Be aware, the empty droplet detection step is quite lenghty! After all, you are scanning several millions of droplets, most of them empty! To fasten the process, you can specify a number of cpus to use in parallel, with the BiocParallel::MulticoreParam() function.
Answer
# emptyDrops performs Monte Carlo simulations to compute p-values, so we need to set the seed to obtain reproducible results.library(DropletUtils)set.seed(100)# Do not run if not on an HPC cluster: # heart_droplets <- emptyDrops(counts(heart_raw), BPPARAM = BiocParallel::MulticoreParam(workers = 40))heart_dropletstable(heart_droplets$FDR<=0.001)heart_filtered<-heart_raw[, which(heart_droplets$FDR<=0.001)]
Even with multiple cpus, the computation can take up to several hours. So if you wish to skip this step for now, you can use the cellranger-automatically filtered matrix for now. It is not exactly equivalent, but does a fairly good job at finding non-empty droplets and I would recommend sticking to it for the beginning.
heart_filtered<-heart
Flagging cell doublets
Another artifact emerging from non-perfect experimental steps is the sequencing of two cells contained within a single droplet. This can occur when many cells are sequenced on a single 10X Genomics cassette (doublet increase of 1% per 1,000 cells sequenced). This can also occur when cells are not in a perfect single cell suspension.
A way to identify cell doublets is to artifically mix thousands of pairs of cells (columns) of a SingleCellExperiment object, then compare the resulting cells to each cell in the original dataset. Original cells which resemble a lot the artificial doublets are likely doublet themselves.
For now, we can keep these doublets. We will see in the future whether we remove them or not.
3. (Optional) Exclude non-relevant genes from analysis
The 10X Genomics-provided count matrix contains 31053 annotated genes. However, there are likely less than 20,000 of them which are genomic, protein-coding, expressed genes. We can filter genes based on the location, biotype and overall detection in the dataset.
Question
Recover gene annotations as gtf from ensembl uing the AnnotationHub
Filter to only get protein-coding, ENSEMBL+HAVANA-annotated genomic genes
A crude sequencing depth normalization followed by log-transformation. This is usually referred to as “log normalizing”.
A more advanced (and probably more accurate) approach is the variance stabilizing transformation. This aims at removing the relationship between levels at which a gene is expressed and the variance of its expression.
Log-normalization
Just like in bulk high-throughput sequencing experiments, scRNAseq counts have to be normalized to the sequencing depth for each cell. We can define the library size (a.k.a. size factor )as the total sum of counts across all genes for each cell. However, this relies on the assumption that within the entire dataset, most genes are non-differentially expressed and expressed roughly within the same range. Depending on the set up of the scRNAseq experiment, this can be entirely false. To avoid relying on this hypothesis, we can (1) quickly pre-cluster cells, then (2) normalize cells using their library size factor separately in each cluster, then (3) rescaling size factors so that they are comparable across clusters.
Question
Read documentation for scran functions quickCluster() and computeSumFactors(). Compute size factors for each cell in the manually filtered E18 mouse heart scRNAseq dataset
Quickly read the extensive vignette about scRNAseq normalization using variance stabilizing transformation here.
First, check the relationship between (1) mean gene expression and gene expression variance and (2) mean gene expression and gene detection rate, in the manually filtered E18 mouse heart scRNAseq count matrix.
Using variance-stabilized residuals, correct the raw counts in the heart scRNAseq count matrix. You will need sctransform::correct() function to do this
Store the corrected counts in an assay named corrected_counts
Log-transform the corrected_counts using log1p() function and store the transformed counts in an assay named logcounts_vst