9 Lab 5: Dimension reduction, clustering and annotation
9.1 Dimensional reduction for clustering
9.1.1 Preparing dataset
We will prepare scRNAseq data from a PBMC run, provided by 10X and hosted by Bioconductor as a package.
- Which package from
Bioconductorgives streamlined access to PBMC scRNAseq dataset from 10X Genomics? - What does the object contain (type of data, number of cells, batches, organism, …)? Can you get the same data from somewhere else?
9.1.2 Normalize counts using scran
Just like in bulk high-throughput sequencing experiments, scRNAseq counts have to be normalized to the sequencing depth for each cell. We can define the library size as the total sum of counts across all genes for each cell, the expected value of which is assumed to scale with any cell-specific biases. However, this relies on the assumption that within the entire dataset, most genes are non-differentially expressed and expressed roughly within the same range. Depending on the set up of the scRNAseq experiment, this can be entirely false. To avoid relying on this hypothesis, we can (1) quickly pre-cluster cells, then (2) normalize cells using their library size factor separately in each cluster, then (3) rescaling size factors so that they are comparable across clusters.
All of this can be done very simply using the combo quickCluster() + computeSumFactors() + logNormCounts() from scran/scuttle packages.
9.1.3 Feature selection
We often use scRNAseq data in exploratory analyses to characterize heterogeneity across cells. Procedures like clustering and dimensionality reduction compare cells based on their gene expression profiles. The choice of genes to include in this comparison may have a major impact on the performance of downstream methods. Ideally, one wants to only select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure.
The simplest approach to feature selection is to compute the variance of the log-normalized expression values, to select the most variable genes. Modelling of the mean-variance relationship can be achieved by the modelGeneVar() function from the scran package.
R
sce_filtered_variance <- scran::modelGeneVar(sce)
HVGs <- scran::getTopHVGs(sce_filtered_variance, prop = 0.1)
rowData(sce)$isHVG <- rownames(sce) %in% HVGs
head(rowData(sce))
table(rowData(sce)$isHVG)
## --- Visualizing the mean-variance fit
df <- tibble(
mean = metadata(sce_filtered_variance)$mean,
var = metadata(sce_filtered_variance)$var,
trend = metadata(sce_filtered_variance)$trend(mean),
HVG = rowData(sce)$isHVG
)
ggplot(df) +
geom_point(aes(x = mean, y = var, col = HVG), alpha = 0.4) +
geom_line(aes(x = mean, y = trend), col = 'darkred') +
theme_minimal() +
labs(x = 'Gene mean exp. (norm.)', y = 'Gene exp. variance')9.1.4 PCA on filtered dataset
We now have normalized counts filtered for the top 500 genes varying with the greatest biological significance.
Still, that represents a 500 x nCells (~8,000) dataset (each row being a feature). This is still too big to reliably use in standard clustering approaches. We can further compress the dataset. The most widely used approach is PCA: it computes a small number of “components” (typically 5-50) optimally summarizing the variability of the whole dataset, while retaining linearity of the underlying numerical data and being computationallt quite efficient.
- Leverage
scaterpackage to compute aPCAembedding of the filtered data by taking into account the technical variability.
R
sce <- scran::denoisePCA(
sce,
technical = sce_filtered_variance,
subset.row = HVGs,
min.rank = 15
)
p <- scater::plotReducedDim(sce, 'PCA', colour_by = 'sizeFactor') + ggtitle('denoised PCA')
p9.2 Clustering
Clustering is an unsupervised learning procedure that is used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation.
After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data.
9.2.1 Clustering algorithms
Three main approaches can be used:
- Hierarchical clustering
- k-means clustering
- Graph-based clustering
Today, we will focus on graph-based clustering, as it is becoming the standard for scRNAseq: it is a flexible and scalable technique for clustering even the largest scRNA-seq datasets. We first build a graph where each node is a cell that is connected by edges to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related.
- Compute graph-based clustering of the PBMC dataset.
- What are the main parameters to choose? How do they impact the clustering?
R
graph2 <- scran::buildSNNGraph(
sce,
k = 50,
use.dimred = 'PCA'
)
sce_clust2 <- igraph::cluster_louvain(graph2)$membership
table(sce_clust, sce_clust2)9.2.2 Dimensional reduction for clustering visualization
PCA is a powerful linear approach to compress large datasets into smaller dimensional spaces. However, it struggles at emphasizing the existence of clusters in complex datasets, when visualized in 2D.
scater provides a handy way to perform more complex data embeddings:
- tSNE
- UMAP
- Diffusion Map
- Multi-Dimensional Scaling (MDS)
- Non-negative Matrix Factorization (NMF)
- Explore the different dimensional reduction algorithms, trying different hyperparameters combinations.
When you run these commands, pay attention to how long each command takes to run! While this run, check the Help page for each function (e.g. ?runTSNE)
R
reducedDims(sce)
sce <- scater::runTSNE(sce, subset_row = HVGs)
sce <- scater::runUMAP(sce, subset_row = HVGs)
reducedDims(sce)
reducedDim(sce, 'UMAP')[1:10, ]- Use the
scater::plotReducedDim()function to plot cells in each embedding. Comment.
9.2.3 For the pros of clustering… Compare different clustering approaches
Leveraging the bluster package, different clustering approaches can be performed using a uniformed syntax, to compare their output.
- Using
clusterSweep(), compare the effect of differentkneighbor values when performing graph-based clustering.
R
clusters <- bluster::clusterSweep(
reducedDim(sce, 'PCA'),
BLUSPARAM = bluster::SNNGraphParam(),
k = c(5L, 15L, 25L, 50L),
cluster.fun = c("louvain")
)
colnames(clusters$clusters)
head(clusters$clusters)
clusters$parameters
library(ggraph)
p <- cowplot::plot_grid(
clustree::clustree(
clusters$clusters %>% setNames(1:ncol(.)) %>% as.data.frame(),
prefix = 'X',
edge_arrow=FALSE
),
cowplot::plot_grid(
scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.5_cluster.fun.louvain'])) + ggtitle('k = 5'),
scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.15_cluster.fun.louvain'])) + ggtitle('k = 15'),
scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.25_cluster.fun.louvain'])) + ggtitle('k = 25'),
scater::plotReducedDim(sce, 'TSNE', colour_by = I(clusters$clusters[, 'k.50_cluster.fun.louvain'])) + ggtitle('k = 50')
),
nrow = 2,
rel_heights = c(0.3, 0.7)
)
table(clusters$clusters[, 'k.5_cluster.fun.louvain'])9.3 Cell annotation
9.3.1 Find marker genes
To interpret clustering results, one needs to identify the genes that drive separation between clusters. These marker genes allow to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity.
A general strategy is to perform DE tests between pairs of clusters and then combine results into a single ranking of marker genes for each cluster.
R
markers <- scran::findMarkers(sce, groups = sce$clusters_graph)- Find markers strongly overexpressed in each cluster. Check
?scran::findMarkersto find the right options to use.
- Plot average expression of the first marker of the first cluster in UMAP
R
p <- scater::plotReducedDim(sce, 'TSNE', colour_by = markers[[2]][[1]])9.3.2 Automated cell annotation
Many cell type reference databases are available over the Internet. Today, we will use a reference constructed from Blueprint and ENCODE data (Martens and Stunnenberg 2013; The ENCODE Project Consortium 2012). This reference is available as a SummarizedExperiment containing log-normalized gene expression for manually annotated samples.
- Using
scaterandSingleRutilities, visually compare the annotation scores for cells in each cluster. - Did the automated annotation work robuslty? How does it compare to our clustering? Is automated annotation as sensitive as graph-based clustering?
R
p <- SingleR::plotScoreHeatmap(prediction_types)
p <- scater::plotReducedDim(sce, 'TSNE', colour_by = 'annotation') + ggtitle('Automated annotation')
p <- pheatmap::pheatmap(
log2(table(Annotation = sce$annotation, Cluster = sce$clusters_graph)+10),
color = colorRampPalette(c("white", "darkred"))(101)
)9.4 Bonus
Try to fill in the analysis template in bin/prepare_Ernst.R to execute the different processing/analysis steps we covered in the previous exercises and this one. If you prefer using Seurat, don’t hesitate to modify the base template!
9.5 Acknowledgements
This exercise was adapted from Chapts. 7-12 of Orchestrating Single-Cell Analysis with Bioconductor.
9.6 Session info
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.3.0 (2023-04-21)
os macOS Monterey 12.5.1
system aarch64, darwin20
ui X11
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/Paris
date 2023-06-08
pandoc 2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
AnnotationDbi * 1.61.0 2022-11-01 [1] Bioconductor
AnnotationHub * 3.7.3 2023-03-01 [1] Bioconductor
Biobase * 2.59.0 2022-11-01 [1] Bioconductor
BiocFileCache * 2.7.2 2023-02-17 [1] Bioconductor
BiocGenerics * 0.45.0 2022-11-01 [1] Bioconductor
BiocManager * 1.30.20 2023-02-24 [1] CRAN (R 4.3.0)
BiocVersion 3.17.1 2022-12-20 [1] Bioconductor
Biostrings 2.67.0 2022-11-01 [1] Bioconductor
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.3.0)
blob 1.2.3 2022-04-10 [1] CRAN (R 4.3.0)
cachem 1.0.7 2023-02-24 [1] CRAN (R 4.3.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0)
cli 3.6.0 2023-01-09 [1] CRAN (R 4.3.0)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.0)
cowplot * 1.1.1 2020-12-30 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
curl 5.0.0 2023-01-12 [1] CRAN (R 4.3.0)
DBI 1.1.3 2022-06-18 [1] CRAN (R 4.3.0)
dbplyr * 2.3.1 2023-02-24 [1] CRAN (R 4.3.0)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
evaluate 0.20 2023-01-17 [1] CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
filelock 1.0.2 2018-10-05 [1] CRAN (R 4.3.0)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.0)
fs 1.6.1 2023-02-06 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.0)
GenomeInfoDb 1.35.15 2023-02-03 [1] Bioconductor
GenomeInfoDbData 1.2.9 2022-11-04 [1] Bioconductor
ggplot2 * 3.4.1 2023-02-10 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
gtable 0.3.1 2022-09-01 [1] CRAN (R 4.3.0)
hms 1.1.2 2022-08-19 [1] CRAN (R 4.3.0)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.1 2023-01-07 [1] CRAN (R 4.3.0)
httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.3.0)
httr 1.4.5 2023-02-24 [1] CRAN (R 4.3.0)
interactiveDisplayBase 1.37.0 2022-11-01 [1] Bioconductor
IRanges * 2.33.0 2022-11-01 [1] Bioconductor
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.3.0)
KEGGREST 1.39.0 2022-11-01 [1] Bioconductor
knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.0)
pillar 1.8.1 2022-08-19 [1] CRAN (R 4.3.0)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.3.0)
png 0.1-8 2022-11-29 [1] CRAN (R 4.3.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.3.0)
processx 3.8.0 2022-10-26 [1] CRAN (R 4.3.0)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
ps 1.7.2 2022-10-26 [1] CRAN (R 4.3.0)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
RCurl 1.98-1.10 2023-01-27 [1] CRAN (R 4.3.0)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.3.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.3.0)
rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.3.0)
RSQLite 2.3.0 2023-02-17 [1] CRAN (R 4.3.0)
S4Vectors * 0.37.4 2023-02-26 [1] Bioconductor
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.3.0)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.0)
tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
xfun 0.37 2023-01-31 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
XVector 0.39.0 2022-12-20 [1] Bioconductor
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
zlibbioc 1.45.0 2022-12-20 [1] Bioconductor
[1] /Users/jacques/Library/R/arm64/4.3/library
[2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────