Exercises: scRNAseq analysis with R/Bioconductor (2/3)

Goals:


1. Pre-processing PBMC dataset

We will prepare scRNAseq data from a PBMC run, provided by 10X and hosted by Bioconductor as a package.

Preparing dataset

Question
  • Which package from Bioconductor gives streamlined access to PBMC scRNAseq dataset from 10X Genomics?
  • Import the 4K PBMCs dataset provided by 10X Genomics directly in R.
  • What does the object contain (type of data, number of cells, batches, organism, …)?
pbmc <- TENxPBMCData::TENxPBMCData('pbmc4k')
rownames(pbmc) <- scuttle::uniquifyFeatureNames(rowData(pbmc)$ENSEMBL_ID, rowData(pbmc)$Symbol_TENx)
pbmc
rowData(pbmc)
colData(pbmc)
table(pbmc$Library)

Remove doublets and filter non-relevant genes

Question

Use scDblFinder to flag and remove cell doublets

pbmc <- scDblFinder::scDblFinder(pbmc)
table(pbmc$scDblFinder.class)
pbmc <- pbmc[, pbmc$scDblFinder.class == 'singlet']

You will then need to import gene annotations (from the right organism!) in R, to then filter out irrelevant genes.

Question
  • Get gene loci from Ensembl using AnnotationHub
  • Filter to only get protein-coding, ENSEMBL+HAVANA-annotated genomic genes
  • Further remove genes that are not expressed in at least 10 cells
# Annotate genes in pbmc dataset
library(plyranges)
ah <- AnnotationHub::AnnotationHub()
AnnotationHub::query(ah, c('gene annotation', 'ensembl', '102', 'homo_sapiens', 'GRCh38'))
gtf <- AnnotationHub::query(ah, c('Homo_sapiens.GRCh38.102.chr.gtf'))[[1]]
genes <- gtf %>% 
    filter(type == 'gene') %>% 
    filter(gene_biotype == 'protein_coding') %>% 
    filter(gene_source == 'ensembl_havana')

pbmc <- pbmc[genes$gene_id[genes$gene_id %in% rownames(pbmc)], ]
rowRanges(pbmc) <- genes[match(rownames(pbmc), genes$gene_id)]
rowData(pbmc) <- rowData(pbmc)[, c('gene_name', 'gene_id')]
rownames(pbmc) <- scuttle::uniquifyFeatureNames(rowData(pbmc)$gene_id, rowData(pbmc)$gene_name)

# Genes / transcripts detected per cell
pbmc <- scuttle::addPerCellQCMetrics(pbmc)
pbmc <- scuttle::addPerFeatureQCMetrics(pbmc)

# Remove genes not expressed in at least 10 cells
pbmc <- pbmc[rowSums(counts(pbmc) > 0) >= 10, ]

Normalize counts using sctransform

cnts <- as(SingleCellExperiment::counts(pbmc), 'dgCMatrix')
colnames(cnts) <- pbmc$Barcode
rownames(cnts) <- rownames(pbmc)
pbmc_vst <- sctransform::vst(cnts, return_cell_attr = TRUE)
corrected_cnts <- sctransform::correct(pbmc_vst)
assay(pbmc, 'corrected_counts', withDimnames = FALSE) <- corrected_cnts
assay(pbmc, 'logcounts', withDimnames = FALSE) <- log1p(corrected_cnts)

2. Dimensionality reduction

Selection of hyper-variable genes (HVGs)

Dimensionality reduction compare cells based on their gene expression profiles. The choice of genes to include in this comparison may have a major impact on the performance of downstream methods. Ideally, one wants to only select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure.

The simplest approach to feature selection is to simply compute the variance of the log-normalized expression values, to select the most variable genes. Modelling of the mean-variance relationship can be achieved by the modelGeneVar() function from the scran package.

Question
  • Read more about scran::modelGeneVar() online
  • Model gene variance ~ gene average expression. What is the range of biological variance and technical variance?
# Fit the gene variance as a function of the gene mean expression
pbmc_variance <- scran::modelGeneVar(pbmc)
pbmc_variance
quantile(pbmc_variance$bio, seq(0, 1, 0.1))
quantile(pbmc_variance$tech, seq(0, 1, 0.1))

# Visualizing the mean-variance fit
require(tidyverse)
df <- tibble(
    mean = metadata(pbmc_variance)$mean, 
    var = metadata(pbmc_variance)$var, 
    trend = metadata(pbmc_variance)$trend(mean), 
)
p <- ggplot(df) + 
    geom_point(aes(x = mean, y = var), alpha = 0.4) + 
    geom_line(aes(x = mean, y = trend), col = 'darkred') +
    theme_minimal() + 
    labs(x = 'Gene mean exp. (norm.)', y = 'Gene exp. variance')
Question
  • Extract the 20% genes with the highest biological variance.
  • Plot gene variance ~ gene average expression, coloring genes which are flagged as HVGs.
HVGs <- scran::getTopHVGs(pbmc_variance, prop = 0.1)
rowData(pbmc)$isHVG <- rownames(pbmc) %in% HVGs
head(rowData(pbmc))
table(rowData(pbmc)$isHVG)

# Visualizing the mean-variance fit, coloring HVGs
df <- tibble(
    mean = metadata(pbmc_variance)$mean, 
    var = metadata(pbmc_variance)$var, 
    trend = metadata(pbmc_variance)$trend(mean), 
    HVG = rowData(pbmc)$isHVG
)
p <- ggplot(df) + 
    geom_point(aes(x = mean, y = var, col = HVG), alpha = 0.4) + 
    geom_line(aes(x = mean, y = trend), col = 'darkred') +
    theme_minimal() + 
    labs(x = 'Gene mean exp. (norm.)', y = 'Gene exp. variance')

Embedding in a lower dimensional linear space

We now have normalized counts filtered for the top 20% genes varying with the greatest biological significance.
Still, that represents a ~ 1,000 genes x ~4000 cells dataset. This is still too big to reliably use in standard clustering approaches. We can further compress the dataset. The most widely used approach is PCA: it computes a small number of “components” (typically 5-50) optimally summarizing the variability of the whole dataset, while retaining linearity of the underlying numerical data and being computationallt quite efficient.

Question
  • Read scater::denoisePCA() documentation. What is the benefit of this function compared to runPCA()?
  • Leverage scater package to compute PCA embedding of the filtered data, by taking into account the technical variability.
pbmc <- scran::denoisePCA(
    pbmc, 
    technical = pbmc_variance, 
    subset.row = HVGs, 
    min.rank = 15
)
dim(as.data.frame(reducedDim(pbmc)))
head(as.data.frame(reducedDim(pbmc)))
p <- cowplot::plot_grid(
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'detected'),
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'sum')
)
Question

Check levels of gene expression for few genes (e.g. CD8A, MS4A1, …) using PCA embedding for visualization. Comment

p <- cowplot::plot_grid(
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'CD8A'),
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'MS4A1'),
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'PPBP'),
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'FCER1A')
)

3. Clustering

Clustering is an unsupervised learning procedure used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation.

After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data.

Clustering algorithms

Three main approaches can be used:

  1. Hierarchical clustering
  2. k-means clustering
  3. Graph-based clustering

Today, we will focus on graph-based clustering, as it is becoming the standard for scRNAseq: it is a flexible and scalable technique for clustering even the largest scRNA-seq datasets. We first build a graph where each node is a cell that is connected by edges to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related.

Question

Compute graph-based clustering of the PBMC dataset.

graph <- scran::buildSNNGraph(pbmc, use.dimred = 'PCA')
pbmc_clust <- igraph::cluster_louvain(graph)$membership
table(pbmc_clust)
pbmc$clusters_graph <- factor(pbmc_clust)
Question
  • What are the main parameters to choose? How do they impact the clustering?
  • Try a non-default value for k argument. What is the impact on the clustering?
# Re-compute a graph changing the `k` parameter, and identify resulting clusters
graph2 <- scran::buildSNNGraph(pbmc, k = 50, use.dimred = 'PCA')
pbmc_clust2 <- igraph::cluster_louvain(graph2)$membership
pbmc$clusters_graph_2 <- factor(pbmc_clust2)

# Compare original and new clusters
table(pbmc_clust, pbmc_clust2)

# Visually compare original and new clusters
p <- cowplot::plot_grid(
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'clusters_graph', text_by = 'clusters_graph') + ggtitle('SNN-graph clustering (louvain), k = 10'),
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'clusters_graph_2', text_by = 'clusters_graph_2') + ggtitle('SNN-graph clustering (louvain), k = 50')
)

Dimensional reduction for clustering visualization

PCA is a powerful linear approach to compress large datasets into smaller dimensional spaces. However, it struggles at emphasizing the existence of clusters in complex datasets, when visualized in 2D.

scater provides a handy way to perform more complex data embeddings:

- tSNE
- UMAP
- Diffusion Map
- Multi-Dimensional Scaling (MDS)
- Non-negative Matrix Factorization (NMF)
Question
  • Explore the different dimensional reduction algorithms, trying different hyperparameters combinations.
  • When you run these commands, pay attention to how long each command takes to run!
  • While this run, check the Help page for each function (e.g. ?scater::runTSNE)
reducedDims(pbmc)
pbmc <- scater::runTSNE(pbmc)
pbmc <- scater::runUMAP(pbmc)
pbmc <- scater::runDiffusionMap(pbmc, dimred = 'PCA')
reducedDims(pbmc)
reducedDim(pbmc, 'DiffusionMap')[1:10, ]
Question
p<- cowplot::plot_grid(
    scater::plotReducedDim(pbmc, 'PCA', colour_by = 'clusters_graph') + ggtitle('denoised PCA'),
    scater::plotReducedDim(pbmc, 'TSNE', colour_by = 'clusters_graph') + ggtitle('tSNE'),
    scater::plotReducedDim(pbmc, 'UMAP', colour_by = 'clusters_graph') + ggtitle('UMAP'),
    scater::plotReducedDim(pbmc, 'DiffusionMap', colour_by = 'clusters_graph') + ggtitle('Diffusion') 
)
Back to top