Exercises: scRNAseq analysis with R/Bioconductor (3/3)
Goals:
Perform differential expression to suggest preliminary cell type annotations
Perform automated cell annotation using public reference datasets
Attempt scRNAseq sub-clustering to better resolve single cell heterogeneity
0. Pre-processing PBMC dataset
During the previous day, the homeworks focused on processing 4K PBMC dataset to obtain main cell clusters. Here are the main commands to process this dataset.
set.seed(1000)# Importing 4K PBMC data from 10X Genomics in Rpbmc<-TENxPBMCData::TENxPBMCData('pbmc4k')rownames(pbmc)<-scuttle::uniquifyFeatureNames(rowData(pbmc)$ENSEMBL_ID, rowData(pbmc)$Symbol_TENx)# Remove doubletspbmc<-scDblFinder::scDblFinder(pbmc)pbmc<-pbmc[, pbmc$scDblFinder.class=='singlet']# Recover human genomic, protein-coding gene informationslibrary(plyranges)ah<-AnnotationHub::AnnotationHub()AnnotationHub::query(ah, c('gene annotation', 'ensembl', '102', 'homo_sapiens', 'GRCh38'))gtf<-AnnotationHub::query(ah, c('Homo_sapiens.GRCh38.102.chr.gtf'))[[1]]genes<-gtf%>%filter(type=='gene')%>%filter(gene_biotype=='protein_coding')%>%filter(gene_source=='ensembl_havana')# Annotate genes in PBMC dataset and filter out non-relevant genespbmc<-pbmc[genes$gene_name[genes$gene_name%in%rownames(pbmc)], ]rowRanges(pbmc)<-genes[match(rownames(pbmc), genes$gene_name)]rowData(pbmc)<-rowData(pbmc)[, c('gene_name', 'gene_id')]rownames(pbmc)<-scuttle::uniquifyFeatureNames(rowData(pbmc)$gene_id, rowData(pbmc)$gene_name)# Get preliminary QCs per cell and per genepbmc<-scuttle::addPerCellQCMetrics(pbmc)pbmc<-scuttle::addPerFeatureQCMetrics(pbmc)# Filter out genes not expressed in at least 10 cellspbmc<-pbmc[rowSums(counts(pbmc)>0)>=10, ]# Normalize counts using VSTcnts<-as(SingleCellExperiment::counts(pbmc), 'dgCMatrix')colnames(cnts)<-pbmc$Barcoderownames(cnts)<-rownames(pbmc)pbmc_vst<-sctransform::vst(cnts, return_cell_attr =TRUE)corrected_cnts<-sctransform::correct(pbmc_vst)assay(pbmc, 'corrected_counts', withDimnames =FALSE)<-corrected_cntsassay(pbmc, 'logcounts', withDimnames =FALSE)<-log1p(corrected_cnts)# Computing biological variance of each genepbmc_variance<-scran::modelGeneVar(pbmc)HVGs<-scran::getTopHVGs(pbmc_variance, prop =0.1)rowData(pbmc)$isHVG<-rownames(pbmc)%in%HVGs# Embedding dataset in PCA space and removing technical variancepbmc<-scran::denoisePCA(pbmc, technical =pbmc_variance, subset.row =HVGs, min.rank =15)# Embedding dataset in shared k-nearest neighbors graph for clustering graph<-scran::buildSNNGraph(pbmc, use.dimred ='PCA')# Cluster cells using Louvain community finding algorithmpbmc_clust<-igraph::cluster_louvain(graph)$membershiptable(pbmc_clust)pbmc$clusters_graph<-factor(pbmc_clust)# Embedding dataset in t-SNE space for visualizationpbmc<-scater::runTSNE(pbmc)
1. Differential expression analysis and marker genes
To interpret clustering results, one needs to identify the genes that drive separation between clusters. These marker genes allow to assign biological meaning to each cluster based on their functional annotation. In the most obvious case, the marker genes for each cluster are a priori associated with particular cell types, allowing us to treat the clustering as a proxy for cell type identity.
A general strategy is to perform DE tests between pairs of clusters and then combine results into a single ranking of marker genes for each cluster.
Run the function on the PBMC dataset to find all the markers associated with individual graph-based clusters.
Answer
markers<-scran::findMarkers(pbmc, groups =pbmc$clusters_graph)markers%>%as('list')%>%map(function(x){as_tibble(x, rownames ='gene')%>%filter(Top<=5)})%>%bind_rows(.id ='cluster')
Question
Re-run scran::findMarkers() to only find markers strongly overexpressed in each cluster.
Answer
markers<-scran::findMarkers(pbmc, groups =pbmc$clusters_graph, direction ="up", lfc =1)head(markers[[1]])markers%>%as('list')%>%map(function(x){as_tibble(x, rownames ='gene')%>%filter(Top<=5)})%>%bind_rows(.id ='cluster')
Question
Plot average expression of the first marker of the first cluster in tSNE
Check knwon PBMC markers in the Human Protein Atlas, which compiles a very nice overview of gene expression in different cell types, e.g. here.
Looking at these PBMC markers in the dataset, speculate to propose a label for each cluster in this 4K PBMC dataset.
Answer
markers<-c('FCER1A', # DC markers'GNLY', # NK markers'PPBP', # Platelets markers'MS4A7', # Monocytes markers'MS4A1', # B cell markers'IL7R', # CD4 T cell markers'CD8A', 'CD8B'# CD8 T cell markers)p<-lapply(markers, function(g){scater::plotReducedDim(pbmc, 'TSNE', colour_by =g)+ggtitle(g)+theme(legend.position ='none')+theme_bw()})%>%cowplot::plot_grid(plotlist =.)
2. Automated cell annotation
Many human cell type reference databases are available over the Internet, especially for blood tissue. Today, we will use a reference constructed from Monaco et al., Cell Reports 2019 (doi: 10.1016/j.celrep.2019.01.041). This reference is available as a SummarizedExperiment containing log-normalized gene expression for manually annotated samples.
Question
Import Monaco dataset in R. Inspect its content. The structure of the object should feel familiar: it’s a Summarized Experiment!
Check the publication report. How was each sample (column) obtained? What type of sequencing?
What types of cell annotation are available? Can this reference be useful for the annotation of the PBMC dataset?
Note how cells from cluster 1 and 2 are both robustly identifed as DCs. Yet, they appear in tSNE as 2 well-separated clusters. This discrepancy most likely comes from the fact that at a finer level, they seem to be 2 different types of DCs: plasmacytoid DCs and myeloid DCs.
Question
Check genes preferentially enriched in plasma. DCs vs myeloid DCs
Answer
DCs<-pbmc[ , pbmc$label_hierarchy_1=='DC']markers<-scran::findMarkers(DCs, groups =DCs$label_hierarchy_2, direction ="up", lfc =1)markers[[2]]%>%as_tibble(rownames ='gene')%>%dplyr::filter(summary.logFC>log2(2), FDR<=0.01)
3. Subclustering of T cells
T cells are spatially separated in 2 or 3 broad groups. However, their complexity is much more important than this. Despite the fine annotations obtained from transfer of Monaco data, T cells heterogeneity are poorly resolved.
Question
Subset the T cells and re-process them (variance modelling, PCA embedding, graph-based clustering and tSNE embedding)
Answer
Tcells<-pbmc[ , pbmc$label_hierarchy_1=='T']# Computing biological variance of each geneset.seed(1000)Tcells_variance<-scran::modelGeneVar(Tcells)HVGs<-scran::getTopHVGs(Tcells_variance, prop =0.2)rowData(Tcells)$isHVG<-rownames(Tcells)%in%HVGs# Embedding dataset in PCA space and removing technical varianceset.seed(1000)Tcells<-scran::denoisePCA(Tcells, technical =Tcells_variance, subset.row =HVGs, min.rank =15)# Embedding dataset in shared k-nearest neighbors graph for clustering graph<-scran::buildSNNGraph(Tcells, k =5, use.dimred ='PCA', type ='jaccard')# Cluster cells using Louvain community finding algorithmTcells_clust<-igraph::cluster_louvain(graph)$membershiptable(Tcells_clust)Tcells$subclusters_graph<-factor(Tcells_clust)table(Tcells$subclusters_graph, Tcells$clusters_graph)# Embedding dataset in t-SNE space for visualizationset.seed(1000)Tcells<-scater::runTSNE(Tcells, name ='subTSNE')# Visualize earlier clustering and new clustering p<-cowplot::plot_grid(scater::plotReducedDim(Tcells, dimred ='TSNE', colour_by ='clusters_graph', text_by ='clusters_graph')+ggtitle('T cells, original tSNE, original clusters'),scater::plotReducedDim(Tcells, dimred ='subTSNE', colour_by ='clusters_graph', text_by ='clusters_graph')+ggtitle('T cells, new tSNE, original clusters'),scater::plotReducedDim(Tcells, dimred ='subTSNE', colour_by ='subclusters_graph', text_by ='subclusters_graph')+ggtitle('T cells, new tSNE, new clusters'))
Question
Re-tranfer annotations from Monaco et al. to only T cells
Does re-transferring annotations on a subset of cells change the annotation obtained for each individual cell?
Answer
Tcells_predictions_main<-SingleR::SingleR( test =Tcells, ref =monaco, labels =monaco$label.main)Tcells_predictions_fine<-SingleR::SingleR( test =Tcells, ref =monaco, labels =monaco$label.fine)Tcells$subannotation_hierarchy_1<-Tcells_predictions_main$labelsTcells$subannotation_hierarchy_2<-Tcells_predictions_fine$labelstable(Tcells$subannotation_hierarchy_1, Tcells$annotation_hierarchy_1)table(Tcells$subannotation_hierarchy_2, Tcells$annotation_hierarchy_2)
Question
Compare the subclusters to transferred annotations. Does subclustering help better representing the heterogeneity of T cells?
Propose alternative(s) to better resolve single-cell transcriptomes of T cells.
Answer
# Visualize earlier clustering and new clustering p<-cowplot::plot_grid(scater::plotReducedDim(Tcells, dimred ='subTSNE', colour_by ='subclusters_graph', text_by ='subclusters_graph')+ggtitle('T cells, new tSNE, new clusters'), scater::plotReducedDim(Tcells, dimred ='subTSNE', colour_by ='annotation_hierarchy_2')+ggtitle('T cells, new tSNE, transferred annotations'))# Compare T cells original clusters and new subclusters to transferred fine annotationstable(Tcells$annotation_hierarchy_2, Tcells$clusters_graph)table(Tcells$subannotation_hierarchy_2, Tcells$subclusters_graph)p<-pheatmap::pheatmap(log2(table(Assigned =Tcells$annotation_hierarchy_2, Cluster =Tcells$clusters_graph)+10), color=colorRampPalette(c("white", "darkred"))(101), cluster_rows =FALSE, cluster_cols =FALSE)p<-pheatmap::pheatmap(log2(table(Assigned =Tcells$annotation_hierarchy_2, Cluster =Tcells$subclusters_graph)+10), color=colorRampPalette(c("white", "darkred"))(101), cluster_rows =FALSE, cluster_cols =FALSE)