Exercises: scRNAseq analysis with R/Bioconductor (1/3)

Goals:


1. Import single-cell RNA-seq data in R

Import data from cellranger workflow

Importing 10X Genomics scRNAseq data in R can be done using DropletUtils package.

Question
  • Read the documentation for DropletUtils utilities useful for 10X Genomics data import here.
  • Download 1K mouse E18 heart scRNAseq filtered count matrix from 10X Genomics (in HDF5 format available here).
  • Import it into R using read10xCounts() function.
dir.create('data/E18_Heart/cellranger/E18_Heart/')
download.file(
    url = 'https://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_1k_v3/heart_1k_v3_filtered_feature_bc_matrix.h5', 
    destfile = 'data/E18_Heart/cellranger/E18_Heart/heart_1k_v3_filtered_feature_bc_matrix.h5',
    mode = 'wb'
)
heart <- DropletUtils::read10xCounts('data/E18_Heart/cellranger/E18_Heart/heart_1k_v3_filtered_feature_bc_matrix.h5')
Question
  • How many cells were sequenced in this dataset? How many genes are profiled?
  • What is the distribution of genes being detected per cell, and of number of unique transcripts being detected per cell? And for each gene, what is the distribution of number of cells it is detected in? Use QC functions from the scuttle package to automatically compute these metrics.
  • What is the sparsity of the data (in other words, how dense is the count matrix)?
# Experiment size 
dim(heart)

# Genes / transcripts detected per cell
heart <- scuttle::addPerCellQCMetrics(heart)
heart <- scuttle::addPerFeatureQCMetrics(heart)
quantile(heart$sum, seq(0, 1, 0.1))
quantile(heart$detected, seq(0, 1, 0.1))
quantile(rowData(heart)$detected, seq(0, 1, 0.1))

# Count matrix density
sum(counts(heart) > 0) / {dim(heart)[1] * dim(heart)[2]}

Use “pre-compiled” datasets

The scRNAseq package (from Bioconductor) allows one to import public datasets directly in R.

Question
zeisel <- scRNAseq::ZeiselBrainData()
Question
  • How many cells were sequenced in this dataset? How many genes are profiled?
  • What is the distribution of genes being detected per cell, and of number of unique transcripts being detected per cell?
  • What is the sparsity of the data (in other words, how dense is the count matrix)?
  • Compare with 10X Genomics-provided data. Comment on the differences. What was the single-cell sequencing technique used in Zeisel et al.?
# Experiment size 
dim(zeisel)

# Genes / transcripts detected per cell
zeisel <- scuttle::addPerCellQCMetrics(zeisel)
zeisel <- scuttle::addPerFeatureQCMetrics(zeisel)
quantile(zeisel$sum, seq(0, 1, 0.1))
quantile(zeisel$detected, seq(0, 1, 0.1))
quantile(rowData(zeisel)$detected, seq(0, 1, 0.1))

# Count matrix density
sum(counts(zeisel) > 0) / {dim(zeisel)[1] * dim(zeisel)[2]}
Question
  • What are the different annotations available for the cells?
  • What are the tissues used for cell profiling? Which type of cells come from which tissue?
# Check cell type annotations
colData(zeisel)
table(zeisel$level1class)
table(zeisel$level2class)
table(zeisel$level2class, zeisel$level1class)

# Check tissue of origin
table(zeisel$tissue)
table(zeisel$level1class, zeisel$tissue)

UMI number / cell

A useful diagnostic for scRNAseq data is the barcode rank plot, which shows the (log-)total UMI count for each barcode on the y-axis and the (log-)rank on the x-axis. This is effectively a transposed empirical cumulative density plot with log-transformed axes. It is useful as it allows users to examine the distribution of total counts across barcodes, focusing on those with the largest counts.

This diagnostic plot can be generated using DropletUtils package, notably the barcodeRanks() function.

Question
  • Try to use barcodeRanks() function to compute and plot UMI # / cell across all the cells in either 10X Genomics-generated data or Zeisel’s data.
  • Comment the difference. Again, what are the important differences in term of single-cell approaches used here?
library(tidyverse)
heart_barcoderanks <- DropletUtils::barcodeRanks(heart)
zeisel_barcoderanks <- DropletUtils::barcodeRanks(zeisel)
p <- list(
    heart = as_tibble(heart_barcoderanks), 
    zeisel = as_tibble(zeisel_barcoderanks)
) %>% 
    bind_rows(.id = 'dataset') %>% 
    ggplot(aes(x = rank, y = total, group = dataset, col = dataset)) + 
    geom_line() + 
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    theme_bw() + 
    labs(x = 'Cells ranked by total UMI', y = 'Total UMI count / cell')

2. Filter emtpy and doublet droplets

An important step when getting your hands on droplet-based single-cell RNA-seq data is to be confident you are working with actual cell data. This means knowing how to deal with/remove (1) emtpy droplets and (2) droplets containing doublets.

Removing empty droplets

The ambient RNA “soup” sometimes makes it difficult to differentiate empty droplets from droplets containing cells with low amounts of RNA.

emptyDrops() function from the DropletUtils package provides a methodology to (1) estimate the ambient RNA contamination and then (2) compute a probability that each droplet contains a cell.

Since emptyDrops() assumes that most of the droplets in a matrix are empty, one needs to start the analysis from the raw unfiltered count matrix provided by 10X Genomics. Download the E18 mouse heart scRNAseq raw unfiltered count matrix from 10X Genomics here.

download.file(
    url = 'https://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_1k_v3/heart_1k_v3_raw_feature_bc_matrix.h5', 
    destfile = 'data/E18_Heart/cellranger/E18_Heart/heart_1k_v3_raw_feature_bc_matrix.h5', 
    mode = 'wb'
)
heart_raw <- DropletUtils::read10xCounts('data/E18_Heart/cellranger/E18_Heart/heart_1k_v3_raw_feature_bc_matrix.h5')
heart_raw
dim(heart_raw)
Question

Use emptyDrops() to differentiate empty droplets from cell-containing droplets.

Be aware, the empty droplet detection step is quite lenghty! After all, you are scanning several millions of droplets, most of them empty! To fasten the process, you can specify a number of cpus to use in parallel, with the BiocParallel::MulticoreParam() function.

# emptyDrops performs Monte Carlo simulations to compute p-values, so we need to set the seed to obtain reproducible results.
library(DropletUtils)
set.seed(100)
# Do not run if not on an HPC cluster: 
# heart_droplets <- emptyDrops(counts(heart_raw), BPPARAM = BiocParallel::MulticoreParam(workers = 40))
heart_droplets
table(heart_droplets$FDR <= 0.001)
heart_filtered <- heart_raw[, which(heart_droplets$FDR <= 0.001)]

Even with multiple cpus, the computation can take up to several hours. So if you wish to skip this step for now, you can use the cellranger-automatically filtered matrix for now. It is not exactly equivalent, but does a fairly good job at finding non-empty droplets and I would recommend sticking to it for the beginning.

heart_filtered <- heart

Flagging cell doublets

Another artifact emerging from non-perfect experimental steps is the sequencing of two cells contained within a single droplet. This can occur when many cells are sequenced on a single 10X Genomics cassette (doublet increase of 1% per 1,000 cells sequenced). This can also occur when cells are not in a perfect single cell suspension.

A way to identify cell doublets is to artifically mix thousands of pairs of cells (columns) of a SingleCellExperiment object, then compare the resulting cells to each cell in the original dataset. Original cells which resemble a lot the artificial doublets are likely doublet themselves.

Question
  • Read scDblFinder documentation here.
  • Use scDblFinder() function to flag probable cell doublets in manually filtered heart dataset.
#BiocManager::install('scDblFinder')
library(scDblFinder)
heart_filtered <- scDblFinder(heart_filtered)
colData(heart_filtered)
table(heart_filtered$scDblFinder.class)

For now, we can keep these doublets. We will see in the future whether we remove them or not.

3. (Optional) Exclude non-relevant genes from analysis

The 10X Genomics-provided count matrix contains 31053 annotated genes. However, there are likely less than 20,000 of them which are genomic, protein-coding, expressed genes. We can filter genes based on the location, biotype and overall detection in the dataset.

Question
  • Recover gene annotations as gtf from ensembl uing the AnnotationHub
  • Filter to only get protein-coding, ENSEMBL+HAVANA-annotated genomic genes
library(plyranges)
ah <- AnnotationHub::AnnotationHub()
AnnotationHub::query(ah, c('gene annotation', 'ensembl', '102', 'mus_musculus', 'GRCm38'))
gtf <- AnnotationHub::query(ah, c('Mus_musculus.GRCm38.102.chr.gtf'))[[1]]
genes <- gtf %>% 
    filter(type == 'gene') %>% 
    filter(gene_biotype == 'protein_coding') %>% 
    filter(gene_source == 'ensembl_havana')
Question

Filter genes from the SingleCellExperiment dataset to only protein-coding, ENSEMBL+HAVANA-annotated genomic genes

names(genes) <- genes$gene_id
table(rownames(heart_filtered) %in% names(genes))
heart_filtered <- heart_filtered[rownames(heart_filtered) %in% names(genes), ]
gr <- genes[rownames(heart_filtered)]
mcols(gr) <- cbind(mcols(gr), rowData(heart_filtered))[, c('gene_id', 'gene_name', 'mean', 'detected')]
rowRanges(heart_filtered) <- gr
Question

Filter remaining genes to only keep those detected in at least 10 cells

quantile(rowSums(counts(heart_filtered) > 0), seq(0, 1, 0.1))
table( rowSums(counts(heart_filtered) > 0) >= 10 )
heart_filtered <- heart_filtered[rowSums(counts(heart_filtered) > 0) >= 10, ]

4. Normalize data

Normalization can be done two ways:

  • A crude sequencing depth normalization followed by log-transformation. This is usually referred to as “log normalizing”.
  • A more advanced (and probably more accurate) approach is the variance stabilizing transformation. This aims at removing the relationship between levels at which a gene is expressed and the variance of its expression.

Log-normalization

Just like in bulk high-throughput sequencing experiments, scRNAseq counts have to be normalized to the sequencing depth for each cell. We can define the library size (a.k.a. size factor )as the total sum of counts across all genes for each cell. However, this relies on the assumption that within the entire dataset, most genes are non-differentially expressed and expressed roughly within the same range. Depending on the set up of the scRNAseq experiment, this can be entirely false. To avoid relying on this hypothesis, we can (1) quickly pre-cluster cells, then (2) normalize cells using their library size factor separately in each cluster, then (3) rescaling size factors so that they are comparable across clusters.

Question

Read documentation for scran functions quickCluster() and computeSumFactors(). Compute size factors for each cell in the manually filtered E18 mouse heart scRNAseq dataset

clusters <- scran::quickCluster(heart_filtered)
table(clusters)
heart_filtered <- scran::computeSumFactors(heart_filtered, cluster = clusters)
colData(heart_filtered)
head(heart_filtered$sizeFactor)
quantile(heart_filtered$sizeFactor, seq(0, 1, 0.1))
Question

Compare the size factor to the total count of UMI / cell. Comment.

heart_filtered <- scuttle::addPerCellQCMetrics(heart_filtered)
heart_filtered <- scuttle::addPerFeatureQCMetrics(heart_filtered)
p <- tibble(
    cell = heart_filtered$Barcode,
    totUMIs = heart_filtered$total, 
    sizeFactor = heart_filtered$sizeFactor
) %>% 
    ggplot(aes(x = totUMIs, y = sizeFactor)) + 
    geom_point() + 
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    theme_bw() + 
    labs(x = 'Total UMI', y = 'Size factors')
Question

Using the computed size factors, perform log-normalization of the data. Read scuttle::logNormCounts() documentation if needed.

heart_filtered <- scuttle::logNormCounts(heart_filtered)
assays(sce)
logcounts(heart_filtered)[1:10, 1:10]
p <- tibble(
    count = c(counts(heart_filtered)), 
    logcount = c(logcounts(heart_filtered))
) %>% 
    filter(count > 0) %>% 
    ggplot(aes(x = count, y = logcount)) + 
    ggrastr::geom_point_rast() + 
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    theme_bw() + 
    labs(x = 'Raw counts', y = 'log-normalized counts')

[BONUS] VST normalization

Question
  • Quickly read the extensive vignette about scRNAseq normalization using variance stabilizing transformation here.
  • First, check the relationship between (1) mean gene expression and gene expression variance and (2) mean gene expression and gene detection rate, in the manually filtered E18 mouse heart scRNAseq count matrix.
cnts <- as(SingleCellExperiment::counts(heart_filtered), 'dgCMatrix')
colnames(cnts) <- heart_filtered$Barcode
rownames(cnts) <- rownames(heart_filtered)
#
df <- tibble(
    gene = rownames(cnts), 
    detection_rate = rowMeans(cnts > 0),
    mean = rowMeans(cnts), 
    variance = apply(cnts, 1, var)
)
p1 <- ggplot(df, aes(x = mean, y = variance)) + 
    geom_point(alpha = 0.3) + 
    geom_density_2d(size = 0.3) + 
    geom_abline(intercept = 0, slope = 1, color = "red") + 
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    labs(x = 'Gene expression mean', y = 'Gene expression variance') +
    theme_bw() 
p2 <- ggplot(df, aes(x = mean, y = detection_rate)) + 
    geom_point(alpha = 0.3) + 
    geom_density_2d(size = 0.3) + 
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    labs(x = 'Gene expression mean', y = 'Gene detection rate') +
    theme_bw() 
p <- cowplot::plot_grid(p1, p2, nrow = 1)
Question

Apply sctransform::vst() function on raw counts from manually filtered E18 mouse heart scRNAseq count matrix.

heart_vst <- sctransform::vst(cnts, return_cell_attr = TRUE)
Question
  • Using variance-stabilized residuals, correct the raw counts in the heart scRNAseq count matrix. You will need sctransform::correct() function to do this
  • Store the corrected counts in an assay named corrected_counts
  • Log-transform the corrected_counts using log1p() function and store the transformed counts in an assay named logcounts_vst
corrected_cnts <- sctransform::correct(heart_vst)
heart_filtered <- heart_filtered[rownames(corrected_cnts),]
assay(heart_filtered, 'corrected_counts', withDimnames = FALSE) <- corrected_cnts
assay(heart_filtered, 'logcounts_vst', withDimnames = FALSE) <- log1p(corrected_cnts)
Question
  • Once this is done, check again the relationship between mean gene expression and gene expression variance.
  • Check how the count variance now varies with increasing mean gene counts. Comment.
  • Check how the detection rate now varies with increasing mean gene counts. Comment.
df <- rbind(
    tibble(
        gene = rownames(cnts), 
        detection_rate = rowMeans(cnts > 0),
        mean = rowMeans(cnts), 
        variance = apply(cnts, 1, var), 
        normalization = 'raw'
    ),
    tibble(
        gene = rownames(corrected_cnts), 
        detection_rate = rowMeans(corrected_cnts > 0),
        mean = rowMeans(corrected_cnts), 
        variance = apply(corrected_cnts, 1, var), 
        normalization = 'vst_corrected'
    )
)
p1 <- ggplot(df, aes(x = mean, y = variance)) + 
    geom_point(alpha = 0.3) + 
    geom_density_2d(size = 0.3) + 
    geom_abline(intercept = 0, slope = 1, color = "red") + 
    scale_y_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    labs(x = 'Gene expression mean', y = 'Gene expression variance') +
    theme_bw() + 
    facet_grid(~normalization) 
p2 <- ggplot(df, aes(x = mean, y = detection_rate)) + 
    geom_point(alpha = 0.3) + 
    geom_density_2d(size = 0.3) + 
    scale_x_log10(
        breaks = scales::trans_breaks("log10", function(x) 10^x),
        labels = scales::trans_format("log10", scales::math_format(10^.x))
    ) +
    labs(x = 'Gene expression mean', y = 'Gene detection rate') +
    theme_bw() + 
    facet_grid(~normalization)
p <- cowplot::plot_grid(p1, p2, nrow = 2)
Back to top