9 Lab 5: Dimension reduction, clustering and annotation

9.1 Dimensional reduction for clustering

9.1.1 Preparing dataset

We will prepare scRNAseq data from a PBMC run, provided by 10X and hosted by Bioconductor as a package.

Which package from Bioconductor gives streamlined access to PBMC scRNAseq dataset from 10X Genomics?
What does the object contain (type of data, number of cells, batches, organism, …)? Can you get the same data from somewhere else?

library(tidyverse)
library(SingleCellExperiment)
sce <- TENxPBMCData::TENxPBMCData('pbmc4k')
rownames(sce) <- scuttle::uniquifyFeatureNames(rowData(sce)$ENSEMBL_ID, rowData(sce)$Symbol_TENx)
sce
rowData(sce)
colData(sce)
table(sce$Library)

9.1.2 Normalize counts using `scran`

Just like in bulk high-throughput sequencing experiments, scRNAseq counts have to be normalized to the sequencing depth for each cell. We can define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. However, this relies on the assumption that within the entire dataset, most genes are non-differentially expressed and expressed roughly within the same range. Depending on the set up of the scRNAseq experiment, this can be entirely false. To avoid relying on this hypothesis, we can (1) quickly pre-cluster cells, then (2) normalize cells using their library size factor separately in each cluster, then (3) rescaling size factors so that they are comparable across clusters.

All of this can be done very simply using the combo quickCluster() + computeSumFactors() + logNormCounts() from scran/scuttle packages.

clusters <- scran::quickCluster(sce)
table(clusters)
sce <- scran::computeSumFactors(sce, cluster = clusters)
colData(sce)
sce <- scuttle::logNormCounts(sce)
assays(sce)

9.1.3 Feature selection

We often use scRNAseq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles. The choice of genes to include in this comparison may have a major impact on the performance of downstream methods. Ideally, one wants to only select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure.

The simplest approach to feature selection is to compute the variance of the log-normalized expression values, to select the most variable genes. Modelling of the mean-variance relationship can be achieved by the modelGeneVar() function from the scran package.

sce_filtered_variance <- scran::modelGeneVar(sce)
HVGs <- scran::getTopHVGs(sce_filtered_variance, prop = 0.1)
rowData(sce)$isHVG <- rownames(sce) %in% HVGs
head(rowData(sce))
table(rowData(sce)$isHVG)

## --- Visualizing the mean-variance fit
df <- tibble(
    mean = metadata(sce_filtered_variance)$mean, 
    var = metadata(sce_filtered_variance)$var, 
    trend = metadata(sce_filtered_variance)$trend(mean), 
    HVG = rowData(sce)$isHVG
)
ggplot(df) + 
    geom_point(aes(x = mean, y = var, col = HVG), alpha = 0.4) + 
    geom_line(aes(x = mean, y = trend), col = 'darkred') +
    theme_minimal() + 
    labs(x = 'Gene mean exp. (norm.)', y = 'Gene exp. variance')

9.1.4 PCA on filtered dataset

We now have normalized counts filtered for the top 500 genes varying with the greatest biological significance.
Still, that represents a 500 x nCells (~8,000) dataset (each row being a feature). This is still too big to reliably use in standard clustering approaches. We can further compress the dataset. The most widely used approach is PCA: it computes a small number of “components” (typically 5-50) optimally summarizing the variability of the whole dataset, while retaining linearity of the underlying numerical data and being computationallt quite efficient.

Leverage scater package to compute a PCA embedding of the filtered data by taking into account the technical variability.

sce <- scran::denoisePCA(
    sce, 
    technical = sce_filtered_variance, 
    subset.row = HVGs, 
    min.rank = 15
)
p <- scater::plotReducedDim(sce, 'PCA', colour_by = 'sizeFactor') + ggtitle('denoised PCA')
p

9.2 Clustering

Clustering is an unsupervised learning procedure that is used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation.

After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data.

9.2.1 Clustering algorithms

Three main approaches can be used:

Hierarchical clustering
k-means clustering
Graph-based clustering

Today, we will focus on graph-based clustering, as it is becoming the standard for scRNAseq: it is a flexible and scalable technique for clustering even the largest scRNA-seq datasets. We first build a graph where each node is a cell that is connected by edges to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related.

Compute graph-based clustering of the PBMC dataset.

graph <- scran::buildSNNGraph(
    sce, 
    k = 5, 
    use.dimred = 'PCA'
)
sce_clust <- igraph::cluster_louvain(graph)$membership
table(sce_clust)
sce$clusters_graph <- factor(sce_clust)

What are the main parameters to choose? How do they impact the clustering?

graph2 <- scran::buildSNNGraph(
    sce, 
    k = 50, 
    use.dimred = 'PCA'
)
sce_clust2 <- igraph::cluster_louvain(graph2)$membership
table(sce_clust, sce_clust2)

9.2.2 Dimensional reduction for clustering visualization

PCA is a powerful linear approach to compress large datasets into smaller dimensional spaces. However, it struggles at emphasizing the existence of clusters in complex datasets, when visualized in 2D.

scater provides a handy way to perform more complex data embeddings:

- tSNE
- UMAP
- Diffusion Map
- Multi-Dimensional Scaling (MDS)
- Non-negative Matrix Factorization (NMF)

Explore the different dimensional reduction algorithms, trying different hyperparameters combinations.

When you run these commands, pay attention to how long each command takes to run! While this run, check the Help page for each function (e.g. ?runTSNE)

reducedDims(sce)
sce <- scater::runTSNE(sce, subset_row = HVGs)
sce <- scater::runUMAP(sce, subset_row = HVGs)
reducedDims(sce)
reducedDim(sce, 'UMAP')[1:10, ]

Use the scater::plotReducedDim() function to plot cells in each embedding. Comment.

library(patchwork)
p<- scater::plotReducedDim(sce, 'PCA', colour_by = 'clusters_graph') + ggtitle('denoised PCA') +
    scater::plotReducedDim(sce, 'TSNE', colour_by = 'clusters_graph') + ggtitle('tSNE') +
    scater::plotReducedDim(sce, 'UMAP', colour_by = 'clusters_graph') + ggtitle('UMAP')

9.2.3 For the pros of clustering… Compare different clustering approaches

Leveraging the bluster package, different clustering approaches can be performed using a uniformed syntax, to compare their output.

Using clusterSweep(), compare the effect of different k neighbor values when performing graph-based clustering.

clusters <- bluster::clusterSweep(
    reducedDim(sce, 'PCA'), 
    BLUSPARAM = bluster::SNNGraphParam(),
    k = c(5L, 15L, 25L, 50L), 
    cluster.fun = c("louvain")
)
colnames(clusters$clusters)
head(clusters$clusters)
clusters$parameters
library(ggraph)
p <- cowplot::plot_grid(
    clustree::clustree(
        clusters$clusters %>% setNames(1:ncol(.)) %>% as.data.frame(),
        prefix = 'X',
        edge_arrow=FALSE
    ), 
    cowplot::plot_grid(
        scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.5_cluster.fun.louvain'])) + ggtitle('k = 5'),
        scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.15_cluster.fun.louvain'])) + ggtitle('k = 15'),
        scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.25_cluster.fun.louvain'])) + ggtitle('k = 25'),
        scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.50_cluster.fun.louvain'])) + ggtitle('k = 50')
    ), 
    nrow = 2, 
    rel_heights = c(0.3, 0.7)
)
table(clusters$clusters[, 'k.5_cluster.fun.louvain'])

9.3 Cell annotation

9.3.1 Find marker genes

To interpret clustering results, one needs to identify the genes that drive separation between clusters. These marker genes allow to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity.

A general strategy is to perform DE tests between pairs of clusters and then combine results into a single ranking of marker genes for each cluster.

markers <- scran::findMarkers(sce, groups = sce$clusters_graph)

Find markers strongly overexpressed in each cluster. Check ?scran::findMarkers to find the right options to use.

markers <- scran::findMarkers(
    sce, 
    groups = sce$clusters_graph, 
    direction = "up", 
    lfc = 1
)
head(markers[[1]])
markers <- lapply(markers, function(df) {
    rownames(df[df$Top <= 5,])
})

Plot average expression of the first marker of the first cluster in UMAP

p <- scater::plotReducedDim(sce, 'TSNE', colour_by = markers[[2]][[1]])

9.3.2 Automated cell annotation

Many cell type reference databases are available over the Internet. Today, we will use a reference constructed from Blueprint and ENCODE data (Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012). This reference is available as a SummarizedExperiment containing log-normalized gene expression for manually annotated samples.

ref <- celldex::BlueprintEncodeData()
prediction_types <- SingleR::SingleR(
    test = sce, 
    ref = ref, 
    labels = ref$label.main
)
sce$annotation <- prediction_types$labels
table(sce$annotation)
table(sce$annotation, sce$clusters_graph)

Using scater and SingleR utilities, visually compare the annotation scores for cells in each cluster.
Did the automated annotation work robuslty? How does it compare to our clustering? Is automated annotation as sensitive as graph-based clustering?

p <- SingleR::plotScoreHeatmap(prediction_types)
p <- scater::plotReducedDim(sce, 'TSNE', colour_by = 'annotation') + ggtitle('Automated annotation')
p <- pheatmap::pheatmap(
    log2(table(Annotation = sce$annotation, Cluster = sce$clusters_graph)+10), 
    color = colorRampPalette(c("white", "darkred"))(101)
)

9.4 Bonus

Try to fill in the analysis template in bin/prepare_Ernst.R to execute the different processing/analysis steps we covered in the previous exercises and this one. If you prefer using Seurat, don’t hesitate to modify the base template!

9.5 Acknowledgements

This exercise was adapted from Chapts. 7-12 of Orchestrating Single-Cell Analysis with Bioconductor.

9.6 Session info

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       macOS Monterey 12.5.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/Paris
 date     2023-06-08
 pandoc   2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package                * version   date (UTC) lib source
 AnnotationDbi          * 1.61.0    2022-11-01 [1] Bioconductor
 AnnotationHub          * 3.7.3     2023-03-01 [1] Bioconductor
 Biobase                * 2.59.0    2022-11-01 [1] Bioconductor
 BiocFileCache          * 2.7.2     2023-02-17 [1] Bioconductor
 BiocGenerics           * 0.45.0    2022-11-01 [1] Bioconductor
 BiocManager            * 1.30.20   2023-02-24 [1] CRAN (R 4.3.0)
 BiocVersion              3.17.1    2022-12-20 [1] Bioconductor
 Biostrings               2.67.0    2022-11-01 [1] Bioconductor
 bit                      4.0.5     2022-11-15 [1] CRAN (R 4.3.0)
 bit64                    4.0.5     2020-08-30 [1] CRAN (R 4.3.0)
 bitops                   1.0-7     2021-04-24 [1] CRAN (R 4.3.0)
 blob                     1.2.3     2022-04-10 [1] CRAN (R 4.3.0)
 cachem                   1.0.7     2023-02-24 [1] CRAN (R 4.3.0)
 callr                    3.7.3     2022-11-02 [1] CRAN (R 4.3.0)
 cli                      3.6.0     2023-01-09 [1] CRAN (R 4.3.0)
 colorspace               2.1-0     2023-01-23 [1] CRAN (R 4.3.0)
 cowplot                * 1.1.1     2020-12-30 [1] CRAN (R 4.3.0)
 crayon                   1.5.2     2022-09-29 [1] CRAN (R 4.3.0)
 curl                     5.0.0     2023-01-12 [1] CRAN (R 4.3.0)
 DBI                      1.1.3     2022-06-18 [1] CRAN (R 4.3.0)
 dbplyr                 * 2.3.1     2023-02-24 [1] CRAN (R 4.3.0)
 devtools                 2.4.5     2022-10-11 [1] CRAN (R 4.3.0)
 digest                   0.6.31    2022-12-11 [1] CRAN (R 4.3.0)
 dplyr                  * 1.1.0     2023-01-29 [1] CRAN (R 4.3.0)
 ellipsis                 0.3.2     2021-04-29 [1] CRAN (R 4.3.0)
 evaluate                 0.20      2023-01-17 [1] CRAN (R 4.3.0)
 fansi                    1.0.4     2023-01-22 [1] CRAN (R 4.3.0)
 fastmap                  1.1.1     2023-02-24 [1] CRAN (R 4.3.0)
 filelock                 1.0.2     2018-10-05 [1] CRAN (R 4.3.0)
 forcats                * 1.0.0     2023-01-29 [1] CRAN (R 4.3.0)
 fs                       1.6.1     2023-02-06 [1] CRAN (R 4.3.0)
 generics                 0.1.3     2022-07-05 [1] CRAN (R 4.3.0)
 GenomeInfoDb             1.35.15   2023-02-03 [1] Bioconductor
 GenomeInfoDbData         1.2.9     2022-11-04 [1] Bioconductor
 ggplot2                * 3.4.1     2023-02-10 [1] CRAN (R 4.3.0)
 glue                     1.6.2     2022-02-24 [1] CRAN (R 4.3.0)
 gtable                   0.3.1     2022-09-01 [1] CRAN (R 4.3.0)
 hms                      1.1.2     2022-08-19 [1] CRAN (R 4.3.0)
 htmltools                0.5.4     2022-12-07 [1] CRAN (R 4.3.0)
 htmlwidgets              1.6.1     2023-01-07 [1] CRAN (R 4.3.0)
 httpuv                   1.6.9     2023-02-14 [1] CRAN (R 4.3.0)
 httr                     1.4.5     2023-02-24 [1] CRAN (R 4.3.0)
 interactiveDisplayBase   1.37.0    2022-11-01 [1] Bioconductor
 IRanges                * 2.33.0    2022-11-01 [1] Bioconductor
 jsonlite                 1.8.4     2022-12-06 [1] CRAN (R 4.3.0)
 KEGGREST                 1.39.0    2022-11-01 [1] Bioconductor
 knitr                    1.42      2023-01-25 [1] CRAN (R 4.3.0)
 later                    1.3.0     2021-08-18 [1] CRAN (R 4.3.0)
 lifecycle                1.0.3     2022-10-07 [1] CRAN (R 4.3.0)
 lubridate              * 1.9.2     2023-02-10 [1] CRAN (R 4.3.0)
 magrittr                 2.0.3     2022-03-30 [1] CRAN (R 4.3.0)
 memoise                  2.0.1     2021-11-26 [1] CRAN (R 4.3.0)
 mime                     0.12      2021-09-28 [1] CRAN (R 4.3.0)
 miniUI                   0.1.1.1   2018-05-18 [1] CRAN (R 4.3.0)
 munsell                  0.5.0     2018-06-12 [1] CRAN (R 4.3.0)
 pillar                   1.8.1     2022-08-19 [1] CRAN (R 4.3.0)
 pkgbuild                 1.4.0     2022-11-27 [1] CRAN (R 4.3.0)
 pkgconfig                2.0.3     2019-09-22 [1] CRAN (R 4.3.0)
 pkgload                  1.3.2     2022-11-16 [1] CRAN (R 4.3.0)
 png                      0.1-8     2022-11-29 [1] CRAN (R 4.3.0)
 prettyunits              1.1.1     2020-01-24 [1] CRAN (R 4.3.0)
 processx                 3.8.0     2022-10-26 [1] CRAN (R 4.3.0)
 profvis                  0.3.7     2020-11-02 [1] CRAN (R 4.3.0)
 promises                 1.2.0.1   2021-02-11 [1] CRAN (R 4.3.0)
 ps                       1.7.2     2022-10-26 [1] CRAN (R 4.3.0)
 purrr                  * 1.0.1     2023-01-10 [1] CRAN (R 4.3.0)
 R6                       2.5.1     2021-08-19 [1] CRAN (R 4.3.0)
 rappdirs                 0.3.3     2021-01-31 [1] CRAN (R 4.3.0)
 Rcpp                     1.0.10    2023-01-22 [1] CRAN (R 4.3.0)
 RCurl                    1.98-1.10 2023-01-27 [1] CRAN (R 4.3.0)
 readr                  * 2.1.4     2023-02-10 [1] CRAN (R 4.3.0)
 remotes                  2.4.2     2021-11-30 [1] CRAN (R 4.3.0)
 rlang                    1.0.6     2022-09-24 [1] CRAN (R 4.3.0)
 rmarkdown                2.20      2023-01-19 [1] CRAN (R 4.3.0)
 RSQLite                  2.3.0     2023-02-17 [1] CRAN (R 4.3.0)
 S4Vectors              * 0.37.4    2023-02-26 [1] Bioconductor
 scales                   1.2.1     2022-08-20 [1] CRAN (R 4.3.0)
 sessioninfo              1.2.2     2021-12-06 [1] CRAN (R 4.3.0)
 shiny                    1.7.4     2022-12-15 [1] CRAN (R 4.3.0)
 stringi                  1.7.12    2023-01-11 [1] CRAN (R 4.3.0)
 stringr                * 1.5.0     2022-12-02 [1] CRAN (R 4.3.0)
 tibble                 * 3.1.8     2022-07-22 [1] CRAN (R 4.3.0)
 tidyr                  * 1.3.0     2023-01-24 [1] CRAN (R 4.3.0)
 tidyselect               1.2.0     2022-10-10 [1] CRAN (R 4.3.0)
 tidyverse              * 2.0.0     2023-02-22 [1] CRAN (R 4.3.0)
 timechange               0.2.0     2023-01-11 [1] CRAN (R 4.3.0)
 tzdb                     0.3.0     2022-03-28 [1] CRAN (R 4.3.0)
 urlchecker               1.0.1     2021-11-30 [1] CRAN (R 4.3.0)
 usethis                  2.1.6     2022-05-25 [1] CRAN (R 4.3.0)
 utf8                     1.2.3     2023-01-31 [1] CRAN (R 4.3.0)
 vctrs                    0.5.2     2023-01-23 [1] CRAN (R 4.3.0)
 withr                    2.5.0     2022-03-03 [1] CRAN (R 4.3.0)
 xfun                     0.37      2023-01-31 [1] CRAN (R 4.3.0)
 xtable                   1.8-4     2019-04-21 [1] CRAN (R 4.3.0)
 XVector                  0.39.0    2022-12-20 [1] Bioconductor
 yaml                     2.3.7     2023-01-23 [1] CRAN (R 4.3.0)
 zlibbioc                 1.45.0    2022-12-20 [1] Bioconductor

 [1] /Users/jacques/Library/R/arm64/4.3/library
 [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────