Process pancreas scRNAseq datasets from different technologies.
Merge the datasets and analyse them without batch correction.
Use Seurat to perform batch correction using canonical correlation analysis (CCA) and mutual nearest neighbors (MNN).
Use LIGER to perform batch correction using integrative non-negative matrix factorization.
In this lab, we will look at different single cell RNA-seq datasets collected from pancreatic islets. We will look at how different batch correction methods affect our data analysis.
11.1 Read in pancreas expression matrices
Four different datasets are provided in the ~/Share/batch_correction/ directory. These datasets were collected using different single cell RNA-seq technologies.
Question
Import the four datasets into R. What is the size and sparsity of each dataset?
'SeuratObject' was built under R 4.4.0 but the current version is 4.4.1; it is recomended that you reinstall 'SeuratObject' as the ABI for R may have changed
'SeuratObject' was built with package 'Matrix' 1.7.0 but the current version is 1.7.1; it is recomended that you reinstall 'SeuratObject' as the ABI for 'Matrix' may have changed
Attaching package: 'SeuratObject'
The following objects are masked from 'package:base':
intersect, t
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.
Now we will subsample each dataset to only have 500 cells. This is to speed up subsequent analyses. Often when we are setting up our analysis, we work with a subset of the data to make iteration across analysis decisions faster, and once we have finalized how we want to do the analysis, we work with the full dataset.
Question
Within metadata dataframe, save what technology each dataset was generated with.
Randomly subsample each full dataset to 500 cells.
Answer
R
celseq[["tech"]]<-"celseq"celseq2[["tech"]]<-"celseq2"fluidigmc1[["tech"]]<-"fluidigmc1"smartseq2[["tech"]]<-"smartseq2"Idents(celseq)<-"tech"Idents(celseq2)<-"tech"Idents(fluidigmc1)<-"tech"Idents(smartseq2)<-"tech"# This code sub-samples the data in order to speed up calculations and not use too much memory.celseq<-subset(celseq, downsample =500, seed =1)celseq2<-subset(celseq2, downsample =500, seed =1)fluidigmc1<-subset(fluidigmc1, downsample =500, seed =1)smartseq2<-subset(smartseq2, downsample =500, seed =1)
11.3 Cluster pancreatic datasets without batch correction
We will first merge all the cells from the four different experiments together, and cluster all the pancreatic islet datasets to see whether there is a batch effect.
Question
Merge the datasets into a single Seurat object.
Answer
R
# Merge Seurat objects. Original sample identities are stored in gcdata[["tech"]].# Cell names will now have the format tech_cellID (smartseq2_cell1...)add.cell.ids<-c("celseq", "celseq2", "fluidigmc1", "smartseq2")gcdata<-merge(x =celseq, y =list(celseq2, fluidigmc1, smartseq2), add.cell.ids =add.cell.ids, merge.data =FALSE)# Examine gcdata. Notice that there are now 4 counts layers, one layer for each technology.gcdata
An object of class Seurat
33633 features across 2000 samples within 1 assay
Active assay: RNA (33633 features, 0 variable features)
4 layers present: counts.1, counts.2, counts.3, counts.4
Warning: Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.
After merging, we normalize the data with LogNormalize method, and identify the 2000 most variable genes for each dataset, using the vst approach. Then we scale the data using ScaleData.
Question
What is the difference between SelectIntegrationFeatures and FindVariableFeatures in Seurat? Take a look at the documentation for both functions.
Run NormalizeData, FindVariableFeatures, and ScaleData on the merged dataset.
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 2000
Number of edges: 59992
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9181
Number of communities: 16
Elapsed time: 0 seconds
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
23:08:01 UMAP embedding parameters a = 0.583 b = 1.334
23:08:01 Read 2000 rows and found 20 numeric columns
23:08:01 Using FNN for neighbor search, n_neighbors = 15
23:08:01 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 15
23:08:01 Initializing from normalized Laplacian + noise (using RSpectra)
23:08:01 Commencing optimization for 500 epochs, with 39324 positive edges
23:08:03 Optimization finished
# Visualize the Leiden clustering and the batches on the UMAP. # Remember, the clustering is stored in @meta.data in column seurat_clusters and the technology is# stored in the column tech. Remember you can also use DimPlotDimPlot(gcdata, reduction ="umap", group.by ="seurat_clusters")
Are you surprised by the results? Compare to your expectations from the PC biplot of PC1 vs PC2.
What explains these results?
We can assess the quality of the clustering by calculating the adjusted rand index (ARI) between the technology and the cluster labels. This goes between 0 (completely dissimilar clustering) to 1 (identical clustering). The adjustment corrects for chance grouping between cluster elements.
We will now use Seurat to see to what extent it can remove potential batch effects.
The first piece of code will identify variable genes that are highly variable in at least 2/4 datasets. We will use these variable genes in our batch correction.
Why would we implement such a requirement?
Question
The integration workflow in Seurat requires the identification of variable genes that are variable across most samples. Use the IntegrateLayers function, with the CCAIntegration method, to identify anchors on the 4 pancreatic islet datasets, commonly shared variable genes across samples, and integrate samples.
Now that the integration anchors have been used to integrate multiple datasets, we will perform the same clustering analysis as before.
Question
Perform analysis in the new integrated.cca dimensional reduction space.
Cluster the cells.
Perform UMAP embedding and visualize clustering results in 2D.
Answer
R
# Re-join the four layers after integration.gcdata[["RNA"]]<-JoinLayers(gcdata[["RNA"]])# Clusteringgcdata<-FindNeighbors(gcdata, reduction ="integrated.cca", dims =1:20, k.param =20)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 2000
Number of edges: 85037
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8427
Number of communities: 9
Elapsed time: 0 seconds
23:08:16 UMAP embedding parameters a = 0.583 b = 1.334
23:08:16 Read 2000 rows and found 30 numeric columns
23:08:16 Using FNN for neighbor search, n_neighbors = 15
23:08:16 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 15
23:08:16 Initializing from normalized Laplacian + noise (using RSpectra)
23:08:16 Commencing optimization for 500 epochs, with 44152 positive edges
23:08:18 Optimization finished
# Visualize the Louvain clustering and the batches on the UMAP. p1<-DimPlot(gcdata, reduction ="umap", group.by ="seurat_clusters")p2<-DimPlot(gcdata, reduction ="umap", group.by ="tech")p1+p2
# Individually Visualize each technology dataset.DimPlot(gcdata, reduction ="umap", split.by ="tech")
Let’s look again to see how the adjusted rand index changed compared to using no batch correction.
R
ari<-dplyr::select(gcdata[[]], tech, seurat_clusters)ari$tech<-plyr::mapvalues(ari$tech, from =c("celseq", "celseq2", "fluidigmc1", "smartseq2"), to =c(0, 1, 2, 3))adj.rand.index(as.numeric(ari$tech), as.numeric(ari$seurat_clusters))
[1] 0.29663
11.4.1 Differential gene expression and visualization
We can also identify conserved marker genes across the batches. Differential gene expression is done across each batch, and the p-values are combined. (requires metap package installation). Note that we use the original expression data in all visualization and DE tests. We should never use values from the integrated data in DE tests, as they violate assumptions of independence among samples that are required for DE tests!
Testing group celseq: (0) vs (3, 2, 1, 4, 5, 7, 6, 8)
For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
(default method for FindMarkers) please install the presto package
--------------------------------------------
install.packages('devtools')
devtools::install_github('immunogenomics/presto')
--------------------------------------------
After installation of presto, Seurat will automatically use the more
efficient implementation (no further action necessary).
This message will be shown once per session
Testing group celseq2: (0) vs (2, 3, 6, 7, 1, 4, 5, 8)
Testing group fluidigmc1: (0) vs (2, 1, 4, 7, 3, 6, 5, 8)
Testing group smartseq2: (0) vs (1, 5, 4, 6, 3, 2, 7)
Here we use integrative non-negative matrix factorization (NMF/LIGER) to see to what extent it can remove potential batch effects.
The important parameters in the batch correction are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. The number of factors sets the number of factors (consisting of shared and dataset-specific factors) used in factorizing the matrix. The penalty parameter sets the balance between factors shared across the batches and factors specific to the individual batches. The default setting of lambda=5.0 is usually used by the Macosko lab. Resolution=1.0 is used in the Louvain clustering of the shared neighbor factors that have been quantile normalized.
R
ob.list<-list("celseq"=celseq, "celseq2"=celseq2, "fluidigmc1"=fluidigmc1, "smartseq2"=smartseq2)# Create a LIGER object with raw counts data from each batch.library(rliger)
Package `rliger` has been updated massively since version 1.99.0, including the object structure which is not compatible with old versions.
We recommand you backup your old analysis before overwriting any existing result.
`readLiger()` is provided for reading an RDS file storing an old object and it converts the object to the up-to-date structure.
# Find variable genes for LIGER analysis. Identify variable genes that are variable across most samples.var.genes<-SelectIntegrationFeatures(ob.list, nfeatures =2000, verbose =TRUE, fvf.nfeatures =2000, selection.method ="vst")
No variable features found for object1 in the object.list. Running FindVariableFeatures ...
Finding variable features for layer counts
No variable features found for object2 in the object.list. Running FindVariableFeatures ...
Finding variable features for layer counts
No variable features found for object3 in the object.list. Running FindVariableFeatures ...
Finding variable features for layer counts
No variable features found for object4 in the object.list. Running FindVariableFeatures ...
Finding variable features for layer counts
Finally, let’s look at the adjusted rand index between the clusters and technology, and also the genes that load the factors used in data integration.
# Let's look to see how the adjusted rand index changed compared to using no batch correction.tech<-cellMeta(data.liger)$datasetclusters<-cellMeta(data.liger)$leiden_clusterari<-data.frame("tech"=tech, "clusters"=clusters)ari$tech<-plyr::mapvalues(ari$tech, from =c("celseq", "celseq2", "fluidigmc1", "smartseq2"), to =c(0, 1, 2, 3))adj.rand.index(as.numeric(ari$tech), as.numeric(ari$clusters))# Identify shared and batch-specific marker genes from liger factorization.# Use the getFactorMarkers function and choose 2 datasets.# Then plot some genes of interest using plotGene functions.markers<-getFactorMarkers(data.liger, dataset1 ="celseq2", dataset2 ="smartseq2")plotGeneDimRed(data.liger, features ="INS")# Look at factor loadings in factor 9, celseq2 vs smartseq2.plotGeneLoadings(data.liger, markerTable =markers, useFactor =9)
11.6 Additional exploration
11.6.1 Regressing out unwanted covariates
Learn how to regress out different technical covariates (number of UMIs, number of genes, percent mitochondrial reads) by studying Seurat’s PBMC tutorial and the ScaleData() function.
11.6.2 kBET
Within your RStudio session, install k-nearest neighbour batch effect test and learn how to use its functionality to quantify batch effects in the pancreatic data.