7 Lab 4 - Single-cell RNA-seq data wrangling

Aims

To give you experience with the analysis of single cell RNA sequencing (scRNA-seq) including performing quality control and identifying cell type subsets.
To introduce you to scRNA-seq analysis using Bioconductor packages.

7.1 Introduction

Data produced in a single cell RNA-seq experiment has several interesting characteristics that make it distinct from data produced in a bulk population RNA-seq experiment. Two characteristics that are important to keep in mind when working with scRNA-Seq are drop-out (the excessive amount of zeros due to limiting mRNA) and the potential for quality control (QC) metrics to be confounded with biology. This combined with the ability to measure heterogeniety from cells in samples has shifted the field away from the typical analysis in population-based RNA-Seq. Here we demonstrate some approaches to quality control, followed by identifying and analyzing cell subsets.

7.1.1 Load necessary packages

When loading libraries, we are asking R to load code for us written by someone else. It is a convenient way to leverage and reproduce methodology developed by others.

library(tidyverse)
library(SingleCellExperiment)
library(scater)
library(scran)

7.1.2 Read in Pancreas counts matrix.

For this tutorial, we will be analyzing a human pancreas scRNAseq dataset. It is freely available from GEO: link. We start by downloading the cell, features and counts matrix.

Get the downloadable links for each file

download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_barcodes.tsv.gz', 'GSE114802_org4_barcodes.tsv.gz')
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_genes.tsv.gz', 'GSE114802_org4_genes.tsv.gz')
download.file('https://ftp.ncbi.nlm.nih.gov/geo/series/GSE114nnn/GSE114802/suppl/GSE114802_org4_counts.csv.gz', 'GSE114802_org4_counts.csv.gz')

Import each table in R

cells <- read_tsv('~/Share/GSE114802_org4_barcodes.tsv.gz', col_names = FALSE)
genes <- read_tsv('~/Share/GSE114802_org4_genes.tsv.gz', col_names = FALSE)
counts <- read_csv('~/Share/GSE114802_org4_counts.csv.gz', col_names = TRUE)
counts <- counts[, -1]
counts <- as(counts, 'matrix')
counts <- as(counts, 'dgCMatrix')
rownames(counts) <- genes$X1

Transform into a SingleCellExperiment object

sce <- SingleCellExperiment(
    colData = cells, 
    rowData = genes, 
    assays = list('counts' = counts)
)

Examine the SingleCellExperiment object you’ve just created. Get an idea of the size of the dataset, the different data available, etc.

colData(sce)
rowData(sce)
metadata(sce)
dim(sce)
assays(sce)
counts(sce)[1:10, 1:10]
reducedDims(sce)

How much memory does a sparse matrix take up relative to a dense matrix? (use object.size() to get the size of an object…)

counts <- counts(sce)
object.size(counts) # size in bytes
object.size(as.matrix(counts)) # size in bytes

Compare it to the sparsity of the counts (the % of the counts equal to 0)

sum(counts > 0) / (nrow(sce)*ncol(sce))

7.2 Basic QCs

You can learn a lot about your scRNA-seq data’s quality with simple plotting.
Let’s do some plotting to look at the number of reads per cell, reads per genes, expressed genes per cell (often called complexity), and rarity of genes (cells expressing genes).

Look at the summary counts for genes and cells

counts_per_cell <- Matrix::colSums(counts)
counts_per_gene <- Matrix::rowSums(counts)
genes_per_cell <- Matrix::colSums(counts > 0) # count gene only if it has non-zero reads mapped.
hist(log10(counts_per_cell+1), main = '# of counts per cell', col = 'wheat')
hist(log10(genes_per_cell+1), main = '# of expressed genes per cell', col = 'wheat')
plot(counts_per_cell, genes_per_cell, log = 'xy', col = 'wheat')
title('Counts vs genes per cell')

Can you plot a histogram of counts per gene in log10 scale?

cells_per_gene <- Matrix::rowSums(counts > 0) # only count cells where the gene is expressed
hist(log10(cells_per_gene+1), main = '# of cells expressing each gene', col = 'wheat')

Plot cells ranked by their number of detected genes

To do that, first sort cells by their library complexity, ie the number of genes detected per cell.
This is a very useful plot as it shows the distribution of library complexity in the sequencing run.
One can use this plot to investigate observations (potential cells) that are actually failed libraries (lower end outliers) or observations that are cell doublets (higher end outliers).

plot(sort(genes_per_cell), xlab = 'cell', log = 'y', main = '# of genes per cell (ordered)')

Several QCs can be automatically computed using quickPerCellQC(). Try it out and check the results.
What are the total and detected columns?

sce <- scuttle::quickPerCellQC(sce)
colData(sce)

7.3 Access to stored informations

7.3.1 Assay slots

For typical scRNA-seq experiments, a SingleCellExperiment can have multiple assays, corresponding to different metrics. The most basic one is counts.
Different assays store different ‘transformations’ of the counts(e.g. `logcounts).

Try to manually compute logcounts from counts and store it in a new slot

assay(sce, 'logcounts') <- log10(counts(sce) + 1)

7.3.2 Embeddings

Embeddings allow for a representation of large-scale data (N cells x M genes) into smaller dimensions (e.g. 2-50 dimensions). Typical embeddings can be PCA, t-SNE, UMAP, etc… Many embeddings can be computed using run...() functions from Bioconductor packages (e.g. scran, scater, …).

Compute PCA embedding of the dataset using runPCA() from scater package

sce <- scater::runPCA(sce)
plotReducedDim(sce, "PCA")

Compute t-SNE embedding of the dataset using runTSNE() from scater package

sce <- scater::runTSNE(sce)
plotReducedDim(sce, "TSNE")

Compute UMAP embedding of the dataset using runUMAP() from scater package

sce <- scater::runUMAP(sce)
plotReducedDim(sce, "UMAP", colour_by = 'sum')
plotReducedDim(sce, "UMAP", colour_by = 'detected')

7.3.3 Multiple modalities

Alternative ‘modalities’ can be stored in the same SingleCellExperiment object (e.g. if you perform paired single-cell RNA-seq and ATAC-seq). This is done through altExps which can store summarized experiments.

Try to add an altExp (using altExp<- function)

altExp(sce, "ATAC_counts") <- SummarizedExperiment(matrix(rpois(1000, 5), ncol = ncol(sce)))
swapAltExp(sce, "ATAC_counts", saved = "RNA_counts")

Note that features can be different between different altExps.

7.4 Filtering cells and features

7.4.1 Pre-filtering

Filter the SCE to only include (1) cells that have a complexity of 2000 genes or more and (2) genes that are are expressed in 10 or more cells.

sce_filtered <- sce[
    Matrix::rowSums(counts(sce) > 0) > 10, 
    Matrix::colSums(counts(sce) > 0) > 2000
 ]

Almost all our analysis will be on this single object, of class SingleCellExperiment. This object contains various “slots” that will store not only the raw count data, but also the results from various computations below. This has the advantage that we do not need to keep track of inidividual variables of interest - they can all be collapsed into a single object as long as these slots are pre-defined.

7.4.2 Filtering low-quality cells: mitochondrial counts

For each cell, we can calculate the percentage of counts mapping on mitochondrial genes and store it in a column percent_mito in our colData().

Find mitochondrial genes, compute the % of total counts associated with these genes, and store it in colData

rowData(sce_filtered)
mito_genes <- rownames(sce_filtered)[grep(pattern = "^MT-", x = rowData(sce_filtered)$X2)]
mito_genes_counts <- counts(sce_filtered)[mito_genes, ]
percent_mito <- colSums(mito_genes_counts) / sce_filtered$total
hist(percent_mito*100, main = '% of total counts over mitochondrial genes', col = 'wheat')
colData(sce_filtered)$percent_mito <- percent_mito

Remove cells with a % of mitochondrial counts greater than 10%.

sce_filtered <- sce_filtered[
    , 
    sce_filtered$percent_mito <= 0.10
]

7.4.3 Checking housekeeping genes

Another metric we use is the number of house keeping genes expressed in a cell. These genes reflect commomn processes active in a cell and hence are a good global quality measure. They are also abundant and are usually steadliy expressed in cells, thus less sensitive to the high dropout.

# Load the list of housekeeping genes
hkgenes <- read.table("Share/tirosh_house_keeping.txt", skip = 2)
hkgenes <- as.vector(hkgenes$V1)
hkgenes <- rownames(sce_filtered)[match(hkgenes, rowData(sce_filtered)$X2)]
hkgenes <- hkgenes[!is.na(hkgenes)]

Compute the number of detected HK genes for each cell and store it in colData

colData(sce_filtered)$n_expressed_hkgenes <- Matrix::colSums(counts(sce_filtered)[hkgenes, ] > 0)

Plot (in a boxplot) the relationship between the # of detected housekeeping genes and the total UMI count (or # of detected genes) per cell. Comment

colData(sce_filtered)$n_expressed_hkgenes <- Matrix::colSums(counts(sce_filtered)[hkgenes, ] > 0)
boxplot(colData(sce_filtered)$total ~ colData(sce_filtered)$n_expressed_hkgenes)
boxplot(colData(sce_filtered)$detected ~ colData(sce_filtered)$n_expressed_hkgenes)

Remove cells with a # of expressed housekeeping genes greater than 85

sce_filtered <- sce_filtered[, sce_filtered$n_expressed_hkgenes <= 85]

7.4.4 Checking gene set expression

Sometimes we want to ask what is the expression of a gene / a set of a genes across cells. This set of genes may make up a gene expression program we are interested in. Another benefit at looking at gene sets is it reduces the effects of drop outs.

Let’s look at genes involved in the stress signature upon cell dissociation. We calculate these genes average expression levels on the single cell level.

genes_dissoc <- c("ATF3", "BTG2", "CEBPB", "CEBPD", "CXCL3", "CXCL2", "CXCL1", "DNAJA1", "DNAJB1", "DUSP1", "EGR1", "FOS", "FOSB", "HSP90AA1", "HSP90AB1", "HSPA1A", "HSPA1B", "HSPA1A", "HSPA1B", "HSPA8", "HSPB1", "HSPE1", "HSPH1", "ID3", "IER2", "JUN", "JUNB", "JUND", "MT1X", "NFKBIA", "NR4A1", "PPP1R15A", "SOCS3", "ZFP36")
genes_dissoc <- rownames(sce_filtered)[match(genes_dissoc, rowData(sce_filtered)$X2)]
genes_dissoc <- unique(genes_dissoc[!is.na(genes_dissoc)])

Calculate the average gene set expression for each cell

ave_expr_genes_dissoc <- colMeans(logcounts(sce_filtered[genes_dissoc, ]))
colData(sce_filtered)$ave_expr_genes_dissoc <- ave_expr_genes_dissoc

Plot an embedding of the dataset, using a color scale representing the average expression of genes involved in the stress signature upon cell dissociation. Comment.

plotReducedDim(sce_filtered, dimred = 'PCA', colour_by = 'ave_expr_genes_dissoc')

7.5 Session info

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.0 (2023-04-21)
 os       macOS Monterey 12.5.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/Paris
 date     2023-06-08
 pandoc   2.19.2 @ /opt/homebrew/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cachem        1.0.7   2023-02-24 [1] CRAN (R 4.3.0)
 callr         3.7.3   2022-11-02 [1] CRAN (R 4.3.0)
 cli           3.6.0   2023-01-09 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.3.0)
 digest        0.6.31  2022-12-11 [1] CRAN (R 4.3.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
 evaluate      0.20    2023-01-17 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.1   2023-02-06 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 htmltools     0.5.4   2022-12-07 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.1   2023-01-07 [1] CRAN (R 4.3.0)
 httpuv        1.6.9   2023-02-14 [1] CRAN (R 4.3.0)
 jsonlite      1.8.4   2022-12-06 [1] CRAN (R 4.3.0)
 knitr         1.42    2023-01-25 [1] CRAN (R 4.3.0)
 later         1.3.0   2021-08-18 [1] CRAN (R 4.3.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12    2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
 pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.3.0)
 pkgload       1.3.2   2022-11-16 [1] CRAN (R 4.3.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.3.0)
 processx      3.8.0   2022-10-26 [1] CRAN (R 4.3.0)
 profvis       0.3.7   2020-11-02 [1] CRAN (R 4.3.0)
 promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.3.0)
 ps            1.7.2   2022-10-26 [1] CRAN (R 4.3.0)
 purrr         1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 Rcpp          1.0.10  2023-01-22 [1] CRAN (R 4.3.0)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.3.0)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.3.0)
 rmarkdown     2.20    2023-01-19 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 shiny         1.7.4   2022-12-15 [1] CRAN (R 4.3.0)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr       1.5.0   2022-12-02 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.0)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.3.0)
 vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.3.0)
 xfun          0.37    2023-01-31 [1] CRAN (R 4.3.0)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.0)

 [1] /Users/jacques/Library/R/arm64/4.3/library
 [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────