7 Lab 4 - Single-cell RNA-seq data wrangling
- To give you experience with the analysis of single cell RNA sequencing (scRNA-seq) including performing quality control and identifying cell type subsets.
- To introduce you to scRNA-seq analysis using Bioconductor packages.
7.1 Introduction
Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are drop-out (the excessive amount of zeros due to limiting mRNA) and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.
7.1.1 Load necessary packages
When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.
7.1.2 Read in Pancreas counts matrix.
For this tutorial, we will be analyzing a human pancreas scRNAseq dataset. It is freely available from GEO: link. We start by downloading the cell, features and counts matrix.
- Get the downloadable links for each file
R
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_barcodes.tsv.gz', 'GSE114802_org4_barcodes.tsv.gz')
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_genes.tsv.gz', 'GSE114802_org4_genes.tsv.gz')
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_counts.csv.gz', 'GSE114802_org4_counts.csv.gz')- Import each table in
R
R
cells <- read_tsv('~/Share/GSE114802_org4_barcodes.tsv.gz', col_names = FALSE)
genes <- read_tsv('~/Share/GSE114802_org4_genes.tsv.gz', col_names = FALSE)
counts <- read_csv('~/Share/GSE114802_org4_counts.csv.gz', col_names = TRUE)
counts <- counts[, -1]
counts <- as(counts, 'matrix')
counts <- as(counts, 'dgCMatrix')
rownames(counts) <- genes$X1- Transform into a
SingleCellExperimentobject
R
sce <- SingleCellExperiment(
colData = cells,
rowData = genes,
assays = list('counts' = counts)
)- Examine the
SingleCellExperimentobject you’ve just created. Get an idea of the size of the dataset, the different data available, etc.
- How much memory does a sparse matrix take up relative to a dense matrix? (use
object.size()to get the size of an object…)
R
counts <- counts(sce)
object.size(counts) # size in bytes
object.size(as.matrix(counts)) # size in bytes- Compare it to the sparsity of the counts (the % of the counts equal to 0)
7.2 Basic QCs
You can learn a lot about your scRNA-seq data’s quality with simple plotting.
Let’s do some plotting to look at the number of reads per cell, reads per genes, expressed genes per cell (often called complexity), and rarity of genes (cells expressing genes).
- Look at the summary counts for genes and cells
R
counts_per_cell <- Matrix::colSums(counts)
counts_per_gene <- Matrix::rowSums(counts)
genes_per_cell <- Matrix::colSums(counts > 0) # count gene only if it has non-zero reads mapped.
hist(log10(counts_per_cell+1), main = '# of counts per cell', col = 'wheat')
hist(log10(genes_per_cell+1), main = '# of expressed genes per cell', col = 'wheat')
plot(counts_per_cell, genes_per_cell, log = 'xy', col = 'wheat')
title('Counts vs genes per cell')- Can you plot a histogram of counts per gene in log10 scale?
- Plot cells ranked by their number of detected genes
To do that, first sort cells by their library complexity, ie the number of genes detected per cell.
This is a very useful plot as it shows the distribution of library complexity in the sequencing run.
One can use this plot to investigate observations (potential cells) that are actually failed libraries (lower end outliers) or observations that are cell doublets (higher end outliers).
- Several QCs can be automatically computed using
quickPerCellQC(). Try it out and check the results. - What are the
totalanddetectedcolumns?
R
sce <- scuttle::quickPerCellQC(sce)
colData(sce)7.3 Access to stored informations
7.3.1 Assay slots
For typical scRNA-seq experiments, a SingleCellExperiment can have multiple assays, corresponding to different metrics. The most basic one is counts.
Different assays store different ‘transformations’ of the counts(e.g. `logcounts).
- Try to manually compute logcounts from counts and store it in a new slot
7.3.2 Embeddings
Embeddings allow for a representation of large-scale data (N cells x M genes) into smaller dimensions (e.g. 2-50 dimensions). Typical embeddings can be PCA, t-SNE, UMAP, etc… Many embeddings can be computed using run...() functions from Bioconductor packages (e.g. scran, scater, …).
- Compute PCA embedding of the dataset using
runPCA()fromscaterpackage
R
sce <- scater::runPCA(sce)
plotReducedDim(sce, "PCA")- Compute t-SNE embedding of the dataset using
runTSNE()fromscaterpackage
R
sce <- scater::runTSNE(sce)
plotReducedDim(sce, "TSNE")- Compute UMAP embedding of the dataset using
runUMAP()fromscaterpackage
R
sce <- scater::runUMAP(sce)
plotReducedDim(sce, "UMAP", colour_by = 'sum')
plotReducedDim(sce, "UMAP", colour_by = 'detected')7.3.3 Multiple modalities
Alternative ‘modalities’ can be stored in the same SingleCellExperiment object (e.g. if you perform paired single-cell RNA-seq and ATAC-seq). This is done through altExps which can store summarized experiments.
- Try to add an altExp (using
altExp<-function)
R
altExp(sce, "ATAC_counts") <- SummarizedExperiment(matrix(rpois(1000, 5), ncol = ncol(sce)))
swapAltExp(sce, "ATAC_counts", saved = "RNA_counts")Note that features can be different between different altExps.
7.4 Filtering cells and features
7.4.1 Pre-filtering
- Filter the SCE to only include (1) cells that have a complexity of 2000 genes or more and (2) genes that are are expressed in 10 or more cells.
Almost all our analysis will be on this single object, of class SingleCellExperiment. This object contains various “slots” that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.
7.4.2 Filtering low-quality cells: mitochondrial counts
For each cell, we can calculate the percentage of counts mapping on mitochondrial genes and store it in a column percent_mito in our colData().
- Find mitochondrial genes, compute the % of total counts associated with these genes, and store it in
colData
R
rowData(sce_filtered)
mito_genes <- rownames(sce_filtered)[grep(pattern = "^MT-", x = rowData(sce_filtered)$X2)]
mito_genes_counts <- counts(sce_filtered)[mito_genes, ]
percent_mito <- colSums(mito_genes_counts) / sce_filtered$total
hist(percent_mito*100, main = '% of total counts over mitochondrial genes', col = 'wheat')
colData(sce_filtered)$percent_mito <- percent_mito- Remove cells with a % of mitochondrial counts greater than 10%.
R
sce_filtered <- sce_filtered[
,
sce_filtered$percent_mito <= 0.10
]7.4.3 Checking housekeeping genes
Another metric we use is the number of house keeping genes expressed in a cell. These genes reflect commomn processes active in a cell and hence are a good global quality measure. They are also abundant and are usually steadliy expressed in cells, thus less sensitive to the high dropout.
- Compute the number of detected HK genes for each cell and store it in
colData
- Plot (in a boxplot) the relationship between the # of detected housekeeping genes and the total UMI count (or # of detected genes) per cell. Comment
- Remove cells with a # of expressed housekeeping genes greater than 85
R
sce_filtered <- sce_filtered[, sce_filtered$n_expressed_hkgenes <= 85]7.4.4 Checking gene set expression
Sometimes we want to ask what is the expression of a gene / a set of a genes across cells. This set of genes may make up a gene expression program we are interested in. Another benefit at looking at gene sets is it reduces the effects of drop outs.
Let’s look at genes involved in the stress signature upon cell dissociation. We calculate these genes average expression levels on the single cell level.
R
genes_dissoc <- c("ATF3", "BTG2", "CEBPB", "CEBPD", "CXCL3", "CXCL2", "CXCL1", "DNAJA1", "DNAJB1", "DUSP1", "EGR1", "FOS", "FOSB", "HSP90AA1", "HSP90AB1", "HSPA1A", "HSPA1B", "HSPA1A", "HSPA1B", "HSPA8", "HSPB1", "HSPE1", "HSPH1", "ID3", "IER2", "JUN", "JUNB", "JUND", "MT1X", "NFKBIA", "NR4A1", "PPP1R15A", "SOCS3", "ZFP36")
genes_dissoc <- rownames(sce_filtered)[match(genes_dissoc, rowData(sce_filtered)$X2)]
genes_dissoc <- unique(genes_dissoc[!is.na(genes_dissoc)])- Calculate the average gene set expression for each cell
- Plot an embedding of the dataset, using a color scale representing the average expression of genes involved in the stress signature upon cell dissociation. Comment.
R
plotReducedDim(sce_filtered, dimred = 'PCA', colour_by = 'ave_expr_genes_dissoc')7.5 Session info
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.3.0 (2023-04-21)
os macOS Monterey 12.5.1
system aarch64, darwin20
ui X11
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/Paris
date 2023-06-08
pandoc 2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
cachem 1.0.7 2023-02-24 [1] CRAN (R 4.3.0)
callr 3.7.3 2022-11-02 [1] CRAN (R 4.3.0)
cli 3.6.0 2023-01-09 [1] CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.3.0)
evaluate 0.20 2023-01-17 [1] CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
fs 1.6.1 2023-02-06 [1] CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.3.0)
htmlwidgets 1.6.1 2023-01-07 [1] CRAN (R 4.3.0)
httpuv 1.6.9 2023-02-14 [1] CRAN (R 4.3.0)
jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.3.0)
knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
later 1.3.0 2021-08-18 [1] CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.3.0)
mime 0.12 2021-09-28 [1] CRAN (R 4.3.0)
miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
pkgbuild 1.4.0 2022-11-27 [1] CRAN (R 4.3.0)
pkgload 1.3.2 2022-11-16 [1] CRAN (R 4.3.0)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.3.0)
processx 3.8.0 2022-10-26 [1] CRAN (R 4.3.0)
profvis 0.3.7 2020-11-02 [1] CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
ps 1.7.2 2022-10-26 [1] CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 [1] CRAN (R 4.3.0)
remotes 2.4.2 2021-11-30 [1] CRAN (R 4.3.0)
rlang 1.0.6 2022-09-24 [1] CRAN (R 4.3.0)
rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 [1] CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
usethis 2.1.6 2022-05-25 [1] CRAN (R 4.3.0)
vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.3.0)
xfun 0.37 2023-01-31 [1] CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.0)
[1] /Users/jacques/Library/R/arm64/4.3/library
[2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────