Process pancreas scRNAseq datasets from different technologies.
Merge the datasets and analyse them without batch correction.
Use Seurat to perform batch correction using canonical correlation analysis (CCA) and mutual nearest neighbors (MNN).
Use LIGER to perform batch correction using integrative non-negative matrix factorization.
In this lab, we will look at different single cell RNA-seq datasets collected from pancreatic islets. We will look at how different batch correction methods affect our data analysis.
11.1 Read in pancreas expression matrices
Four different datasets are provided in the ~/Share/batch_correction/ directory. These datasets were collected using different single cell RNA-seq technologies.
Question
Import the four datasets into R. What is the size and sparsity of each dataset?
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: Seqinfo
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
Now we will subsample each dataset to only have 500 cells. This is to speed up subsequent analyses. Often when we are setting up our analysis, we work with a subset of the data to make iteration across analysis decisions faster, and once we have finalized how we want to do the analysis, we work with the full dataset.
Question
Within metadata dataframe, save what technology each dataset was generated with.
Randomly subsample each full dataset to 500 cells.
Like what we have done in previous labs, we will now normalize, identify variable genes, and perform PCA embedding.
R
process_batch<-function(sce){clusters<-scran::quickCluster(sce)sce<-scran::computeSumFactors(sce, clusters =clusters)sce<-scuttle::logNormCounts(sce)dec<-scran::modelGeneVar(sce)hvgs<-scran::getTopHVGs(dec, n =2000)rowData(sce)$highly_variable<-rownames(sce)%in%hvgssce<-scater::runPCA(sce, subset_row =hvgs, ncomponents =50)return(sce)}celseq<-process_batch(celseq)celseq2<-process_batch(celseq2)fluidigmc1<-process_batch(fluidigmc1)smartseq2<-process_batch(smartseq2)
11.4 Batch correction with MNN
We will now use MNN to see to what extent it can remove potential batch effects. The integration workflow for SingleCellExperiment objects requires the identification of shared variable genes.
Here we use integrative non-negative matrix factorization (NMF/LIGER) to see to what extent it can remove potential batch effects.
The important parameters in the batch correction are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. The number of factors sets the number of factors (consisting of shared and dataset-specific factors) used in factorizing the matrix. The penalty parameter sets the balance between factors shared across the batches and factors specific to the individual batches. The default setting of lambda=5.0 is usually used by the Macosko lab. Resolution=1.0 is used in the Louvain clustering of the shared neighbor factors that have been quantile normalized.
R
ob.list<-list("celseq"=celseq_raw, "celseq2"=celseq2_raw, "fluidigmc1"=fluidigmc1_raw, "smartseq2"=smartseq2_raw)# Create a LIGER object with raw counts data from each batch.library(rliger)data.liger<-createLiger(lapply(ob.list, function(x)as.matrix(assay(x, "counts"))))# Normalize gene expression for each batch.data.liger<-rliger::normalize(data.liger)# Find variable genes for LIGER analysis. Identify variable genes that are variable across most samples.data.liger<-selectGenes(data.liger)
Clustering can now be performed over the integrated data (using merged_mnn here). Here we use the Louvain clustering algorithm.
Then, differential gene expression is done across each batch, and the p-values are combined. (requires metap package installation). Note that we use the original expression data in all visualization and DE tests. We should never use values from the integrated data in DE tests, as they violate assumptions of independence among samples that are required for DE tests!
the design formula contains one or more numeric variables with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert
this variable to a factor using the factor() function