Dimensionality reduction compare cells based on their gene expression profiles. The choice of genes to include in this comparison may have a major impact on the performance of downstream methods. Ideally, one wants to only select genes that contain useful information about the biology of the system while removing genes that contain random noise. This aims to preserve interesting biological structure without the variance that obscures that structure.
The simplest approach to feature selection is to simply compute the variance of the log-normalized expression values, to select the most variable genes. Modelling of the mean-variance relationship can be achieved by the modelGeneVar() function from the scran package.
Model gene variance ~ gene average expression. What is the range of biological variance and technical variance?
Answer
# Fit the gene variance as a function of the gene mean expressionpbmc_variance<-scran::modelGeneVar(pbmc)pbmc_variancequantile(pbmc_variance$bio, seq(0, 1, 0.1))quantile(pbmc_variance$tech, seq(0, 1, 0.1))# Visualizing the mean-variance fitrequire(tidyverse)df<-tibble( mean =metadata(pbmc_variance)$mean, var =metadata(pbmc_variance)$var, trend =metadata(pbmc_variance)$trend(mean), )p<-ggplot(df)+geom_point(aes(x =mean, y =var), alpha =0.4)+geom_line(aes(x =mean, y =trend), col ='darkred')+theme_minimal()+labs(x ='Gene mean exp. (norm.)', y ='Gene exp. variance')
Question
Extract the 20% genes with the highest biological variance.
Plot gene variance ~ gene average expression, coloring genes which are flagged as HVGs.
Answer
HVGs<-scran::getTopHVGs(pbmc_variance, prop =0.1)rowData(pbmc)$isHVG<-rownames(pbmc)%in%HVGshead(rowData(pbmc))table(rowData(pbmc)$isHVG)# Visualizing the mean-variance fit, coloring HVGsdf<-tibble( mean =metadata(pbmc_variance)$mean, var =metadata(pbmc_variance)$var, trend =metadata(pbmc_variance)$trend(mean), HVG =rowData(pbmc)$isHVG)p<-ggplot(df)+geom_point(aes(x =mean, y =var, col =HVG), alpha =0.4)+geom_line(aes(x =mean, y =trend), col ='darkred')+theme_minimal()+labs(x ='Gene mean exp. (norm.)', y ='Gene exp. variance')
Embedding in a lower dimensional linear space
We now have normalized counts filtered for the top 20% genes varying with the greatest biological significance.
Still, that represents a ~ 1,000 genes x ~4000 cells dataset. This is still too big to reliably use in standard clustering approaches. We can further compress the dataset. The most widely used approach is PCA: it computes a small number of “components” (typically 5-50) optimally summarizing the variability of the whole dataset, while retaining linearity of the underlying numerical data and being computationallt quite efficient.
Question
Read scater::denoisePCA() documentation. What is the benefit of this function compared to runPCA()?
Leverage scater package to compute PCA embedding of the filtered data, by taking into account the technical variability.
Clustering is an unsupervised learning procedure used in scRNA-seq data analysis to empirically define groups of cells with similar expression profiles. Its primary purpose is to summarize the data in a digestible format for human interpretation.
After annotation based on marker genes, the clusters can be treated as proxies for more abstract biological concepts such as cell types or states. Clustering is thus a critical step for extracting biological insights from scRNA-seq data.
Clustering algorithms
Three main approaches can be used:
Hierarchical clustering
k-means clustering
Graph-based clustering
Today, we will focus on graph-based clustering, as it is becoming the standard for scRNAseq: it is a flexible and scalable technique for clustering even the largest scRNA-seq datasets. We first build a graph where each node is a cell that is connected by edges to its nearest neighbors in the high-dimensional space. Edges are weighted based on the similarity between the cells involved, with higher weight given to cells that are more closely related.
Question
Compute graph-based clustering of the PBMC dataset.
What are the main parameters to choose? How do they impact the clustering?
Try a non-default value for k argument. What is the impact on the clustering?
Answer
# Re-compute a graph changing the `k` parameter, and identify resulting clustersgraph2<-scran::buildSNNGraph(pbmc, k =50, use.dimred ='PCA')pbmc_clust2<-igraph::cluster_louvain(graph2)$membershippbmc$clusters_graph_2<-factor(pbmc_clust2)# Compare original and new clusterstable(pbmc_clust, pbmc_clust2)# Visually compare original and new clustersp<-cowplot::plot_grid(scater::plotReducedDim(pbmc, 'PCA', colour_by ='clusters_graph', text_by ='clusters_graph')+ggtitle('SNN-graph clustering (louvain), k = 10'),scater::plotReducedDim(pbmc, 'PCA', colour_by ='clusters_graph_2', text_by ='clusters_graph_2')+ggtitle('SNN-graph clustering (louvain), k = 50'))
Dimensional reduction for clustering visualization
PCA is a powerful linear approach to compress large datasets into smaller dimensional spaces. However, it struggles at emphasizing the existence of clusters in complex datasets, when visualized in 2D.
scater provides a handy way to perform more complex data embeddings: