Exercises: Trajectory inference and RNA velocity

Goals:


0. Prepare data from scratch

Because RNA velocity reflects the balance between immature and mature transcript content in each cell, one need to count the reads overlapping both spliced regions and unspliced regions. These counts are still generally not available when using public datasets. A way to generate them is to:

  1. Get a cellranger-generated bam file of a scRNAseq experiment
  2. Get the corresponding gene annotation file as a gtf file
  3. Run velocyto to count reads mapped to introns or to exons

Let’s do this on a dataset published by Guo et al., Cell Res. 2018 (doi: 10.1038/s41422-018-0099-2). There are 6 bam files corresponding to human male testis single-cell RNA-seq profiling (GSE: GSE112013).

Question
  • Download bam files from GEO
mkdir data/Guo_testis/
ffq -t GSE "GSE112013" | grep 'ftp://' | sed 's,.*ftp:,ftp:,' | sed 's,".*,,' > data/Guo_testis/GSE112013_bam-list.txt
for FILE in `cat data/Guo_testis/GSE112013_bam-list.txt | sed '$d'`
do
    echo $FILE
    curl ${FILE} -o data/Guo_testis/`basename ${FILE}`
done

To run velocyto, one needs to know where introns and exons are located in the genome reference used to process reads. In our case, GRCh38 genome reference was used.

Question
  • Read cellranger instructions on how to generate a gtf file corresponding to GRCh38 genome reference here.
  • Create GRCh38 gene annotation gtf file following Cellranger recommendations
mkdir data/Guo_testis/genome
curl http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz -o data/Guo_testis/genome/gencode.v32.primary_assembly.annotation.gtf.gz
gunzip data/Guo_testis/genome/gencode.v32.primary_assembly.annotation.gtf.gz
ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"
cat data/Guo_testis/genome/gencode.v32.primary_assembly.annotation.gtf \
    | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
    | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
    | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
    > data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_modified.gtf
cat data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_modified.gtf \
    | awk '$3 == "transcript"' \
    | grep -E "$GENE_PATTERN" \
    | grep -E "$TX_PATTERN" \
    | grep -Ev "$READTHROUGH_PATTERN" \
    | grep -Ev "$PAR_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > data/Guo_testis/genome/gene_allowlist
grep -E "^#" data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_modified.gtf > data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_filtered.gtf
grep -Ff data/Guo_testis/genome/gene_allowlist data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_modified.gtf \
    | sed -E 's,^chr,,' \
    | sed -E 's,^M\t,MT\t,' \
    >> data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_filtered.gtf

To make the velocyto step faster, one can only use reads from bam files originating from cell-containing droplets. One way to do so is to extract cell barcodes from the already filtered scRNAseq dataset, and use them in the -b argument of velocyto.

Question
  • (Optional) Get cell barcodes from cellranger (this requires pre-processed scRNAseq data, available from GEO)
library(tidyverse) 
vroom::vroom('data/Guo_testis/GSE112013_Combined_UMI_table.txt') %>% 
    colnames() %>% 
    str_replace_all('Donor.-', '') %>% 
    tail(-1) %>% 
    unique() %>%
    writeLines('data/Guo_testis/testis_cell-barcodes.txt')

By now, we have obtained 3 different files:

  1. BAM files of scRNAseq data mapped onto GRCh38
  2. GTF file of GRCh38 gene annotations
  3. A barcode file for BAM file pre-filtering
Question
  • Run velocyto on each sample
mkdir data/Guo_testis/velocyto
for FILE in `cat data/Guo_testis/GSE112013_bam-list.txt`
do
    curl ${FILE} -o data/Guo_testis/`basename ${FILE}`
    velocyto run \
        -b data/Guo_testis/testis_cell-barcodes.txt \
        -o data/Guo_testis/velocyto/ \
        --samtools-threads 15 \
        -vv \
        data/Guo_testis/`basename ${FILE}` \
        data/Guo_testis/genome/gencode.v32.primary_assembly.annotation_filtered.gtf
    rm data/Guo_testis/`basename ${FILE}`
done

We have obtained six loom files, but ideally we want them merged as a single SingleCellExperiment object readable in R.

Question
  • Read about LoomExperiment package. Is there an easy (and reliable) way to import loom files in R?
  • Merge loom files directly in R and save the resulting object as a rds binary file.
library(tidyverse)
library(LoomExperiment)
looms <- list.files('data/Guo_testis/velocyto/', full.names = TRUE) %>% 
    lapply(LoomExperiment::import) %>% 
    do.call(cbind, .)
looms$sample <- str_replace_all(looms$CellID, ':.*', '')
looms$Barcode <- str_replace_all(looms$CellID, '.*:', '') %>% str_replace('x', '')
# Some additional tidying up... 
# testis_2 <- testis[rownames(testis)[rownames(testis) %in% rowData(looms)$Gene], ]
# genes <- rownames(testis_2)
# bcs <- testis_2$Barcode
# looms <- looms[match(genes, rowData(looms)$Gene), ]
# looms <- scuttle::aggregateAcrossCells(looms, looms$Barcode, use.assay.type = c('spliced', 'unspliced'))
# looms <- looms[, match(bcs, looms$Barcode)]
#
saveRDS(looms, 'data/Guo_testis/testis_velocity-counts.rds')

1. Process testis data in R

The same workflow than previous days can be reused here.

Question
  • Import testis dataset in R, filter cells and genes, normalize counts, embed data and cluster cells.
library(SingleCellExperiment)
library(tidyverse)
download.file(
    'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE112nnn/GSE112013/suppl/GSE112013_Combined_UMI_table.txt.gz', 
    'data/Guo_testis/GSE112013_Combined_UMI_table.txt.gz'
)
system('gunzip data/Guo_testis/GSE112013_Combined_UMI_table.txt.gz')
x <- vroom::vroom('data/Guo_testis/GSE112013_Combined_UMI_table.txt')
cnts <- as.matrix(x[, -1])
gData <- as.data.frame(x[, 1])
cData <- data.frame(cellid = colnames(x[, -1]))
testis <- SingleCellExperiment(
    assays = list(counts = cnts), 
    colData = cData, 
    rowData = gData
)
testis$Barcode <- str_replace(testis$cellid, 'Donor.-', '') %>% str_replace('-.', '')
testis <- testis[, !duplicated(testis$Barcode)]
testis$donor <- str_replace(testis$cellid, '-.*', '')
testis$replicate <- str_replace(testis$cellid, '.*-', '')
rownames(testis) <- rowData(testis)$Gene
set.seed(1000)

# Remove doublets
testis <- scDblFinder::scDblFinder(testis)
testis <- testis[, testis$scDblFinder.class == 'singlet']

# Recover human genomic, protein-coding gene informations
library(plyranges)
ah <- AnnotationHub::AnnotationHub()
AnnotationHub::query(ah, c('gene annotation', 'ensembl', '102', 'homo_sapiens', 'GRCh38'))
gtf <- AnnotationHub::query(ah, c('Homo_sapiens.GRCh38.102.chr.gtf'))[[1]]
genes <- gtf %>% 
    filter(type == 'gene') %>% 
    filter(gene_biotype == 'protein_coding') %>% 
    filter(gene_source == 'ensembl_havana') 
genes <- genes[!duplicated(genes$gene_name)]

# Annotate genes in testis dataset and filter out non-relevant genes
testis <- testis[genes$gene_name[genes$gene_name %in% rownames(testis)], ]
rowRanges(testis) <- genes[match(rownames(testis), genes$gene_name)]
rowData(testis) <- rowData(testis)[, c('gene_name', 'gene_id')]
rownames(testis) <- scuttle::uniquifyFeatureNames(rowData(testis)$gene_id, rowData(testis)$gene_name)

# Get preliminary QCs per cell and per gene
testis <- scuttle::addPerCellQCMetrics(testis)
testis <- scuttle::addPerFeatureQCMetrics(testis)

# Filter out genes not expressed in at least 10 cells
testis <- testis[rowSums(assay(testis, 'counts') > 0) >= 10, ]

# Normalize counts using VST
cnts <- as(assay(testis, 'counts'), 'dgCMatrix')
colnames(cnts) <- testis$cellid
rownames(cnts) <- rownames(testis)
testis_vst <- sctransform::vst(cnts, return_cell_attr = TRUE)
corrected_cnts <- sctransform::correct(testis_vst)
assay(testis, 'corrected_counts', withDimnames = FALSE) <- corrected_cnts
assay(testis, 'logcounts', withDimnames = FALSE) <- log1p(corrected_cnts)

# Computing biological variance of each gene
testis_variance <- scran::modelGeneVar(testis)
HVGs <- scran::getTopHVGs(testis_variance, prop = 0.1)
rowData(testis)$isHVG <- rownames(testis) %in% HVGs

# Embedding dataset in PCA space, correcting for batch effect
mergedBatches <- batchelor::fastMNN(
    testis, 
    batch = testis$donor, 
    subset.row = HVGs, 
    BPPARAM = BiocParallel::MulticoreParam(workers = 12)
)
reducedDim(testis, 'corrected') <- reducedDim(mergedBatches, 'corrected')

# Embedding dataset in shared k-nearest neighbors graph for clustering 
graph <- scran::buildSNNGraph(testis, use.dimred = 'corrected')

# Cluster cells using Louvain community finding algorithm
testis_clust <- igraph::cluster_louvain(graph)$membership
table(testis_clust)
testis$cluster <- factor(testis_clust)

# Embedding dataset in TSNE space for visualization
set.seed(10)
testis <- scater::runTSNE(testis, dimred = 'corrected')
p <- cowplot::plot_grid(
    scater::plotReducedDim(testis, 'corrected', colour_by = 'donor'),
    scater::plotReducedDim(testis, 'corrected', colour_by = 'cluster'),
    scater::plotReducedDim(testis, 'TSNE', colour_by = 'donor'),
    scater::plotReducedDim(testis, 'TSNE', colour_by = 'cluster')
)
Question
  • Load HPA data from internet. Try to format it as a SummarizedExperiment. What celltypes are profiled?
  • Use these cell type profiles to annotate cell types in the testis dataset.
  • How do the annotations look like? Can you find a reason why the label transfer worked so well?
download.file(
    'https://www.proteinatlas.org/download/rna_single_cell_type.tsv.zip', 
    'data/Guo_testis/rna_single_cell_type.tsv.zip'
)
system('unzip data/Guo_testis/rna_single_cell_type.tsv.zip')
system('mv rna_single_cell_type.tsv data/Guo_testis/')
HPA <- vroom::vroom('data/Guo_testis/rna_single_cell_type.tsv') %>% 
    pivot_wider(names_from = `Cell type`, values_from = 'NX') %>% 
    dplyr::select(-Gene) %>% 
    distinct(`Gene name`, .keep_all = TRUE) %>% 
    column_to_rownames('Gene name') %>% 
    SummarizedExperiment::SummarizedExperiment(assays = list('logcounts' = .))

# Transfer annotations to `testis`
predictions <- SingleR::SingleR(
    test = testis, 
    ref = HPA, 
    labels = colnames(HPA)
)
table(predictions$labels, testis$cluster)
labels <- table(predictions$labels, testis$cluster) %>% 
    data.matrix() %>% 
    apply(2, which.max) %>% 
    sort(unique(predictions$labels))[.]
names(labels) <- 1:length(labels)
testis$annotation <- labels[testis$cluster]
p <- cowplot::plot_grid(
    scater::plotReducedDim(testis, dimred = 'corrected', colour_by = 'cluster', text_by = 'cluster') + ggtitle('Testis data, PCA, graph-based clusters'), 
    scater::plotReducedDim(testis, dimred = 'corrected', colour_by = 'annotation', text_by = 'annotation') + ggtitle('PCA, Annotations transferred from HPA'),
    scater::plotReducedDim(testis, dimred = 'TSNE', colour_by = 'cluster', text_by = 'cluster') + ggtitle('Testis data, tSNE, graph-based clusters'), 
    scater::plotReducedDim(testis, dimred = 'TSNE', colour_by = 'annotation', text_by = 'annotation') + ggtitle('tSNE, Annotations transferred from HPA')
)

2. Trajectory inference (TI) in scRNAseq

An important question in scRNAseq field of research is: how to identify a cell trajectory from high-dimensional expression data and map individual cells onto it? A large number of methods have currently emerged, each one with their own specificities, assumptions, and strengths. A nice breakdown (from 2019, so already very outdated!) is available from Saelens et al., Nat. Biotech. 2018 (doi: 10.1038/s41587-019-0071-9):

Slingshot

Slingshot is perhaps one of the most widely used algorithms for users who want to focus on R-based approaches.

Question
  • Read Slingshot documentation to understand how to identify lineages in a scRNAseq dataset in R
  • Infer lineages in the testis dataset
  • Why is it recommended to infer lineages from PCA space rather than t-SNE or UMAP space, even though these spaces do “reveal” an obvious trajectory in 2D?
testis_slingshot <- slingshot::slingshot(testis, reducedDim = 'corrected')
testis_slingshot
slingshot::slingLineages(testis_slingshot)
Question
  • Check the inferred trajectory(ies) in 2D projection. You can use the embedCurves() to embed the curves in any given dimensional space. Do they fit your expectations?
pca_curve <- slingCurves(testis_slingshot, as.df = TRUE)
colnames(pca_curve) <- paste0('PC', 1:ncol(pca_curve))
tsne_curve <- slingshot::embedCurves(testis_slingshot, 'TSNE', smoother = 'loess', span = 0.1) %>% slingCurves(as.df = TRUE)
tsne_curve <- tsne_curve[order(tsne_curve$Order), ]
colnames(tsne_curve)[1:2] <- paste0('TSNE', 1:ncol(tsne_curve))
df <- tibble(
    PC1 = reducedDim(testis, 'corrected')[,1], 
    PC2 = reducedDim(testis, 'corrected')[,2], 
    TSNE1 = reducedDim(testis, 'TSNE')[,1], 
    TSNE2 = reducedDim(testis, 'TSNE')[,2], 
    cluster = testis$cluster
)
p <- cowplot::plot_grid(
    df %>% 
        ggplot() + 
        geom_point(aes(PC1, PC2, col = cluster)) + 
        geom_path(data = pca_curve, aes(x = PC1, y = PC2)) + 
        theme_bw() + 
        coord_fixed(),
    df %>% 
        ggplot() + 
        geom_point(aes(TSNE1, TSNE2, col = cluster)) + 
        geom_path(data = tsne_curve, aes(x = TSNE1, y = TSNE2)) + 
        theme_bw() + 
        coord_fixed()
)
Question
  • Filter the testis dataset to only germinal cells.
  • Re-infer lineages, using cluster annotations as information to build the MST. Note that you will first need to remove the 50th PCA dimension for slingshot to work (bug reported).
  • What do you observe? Discuss.
germcells <- testis[, testis$annotation %in% c("Spermatogonia", "Spermatocytes", "Early spermatids", "Late spermatids")]
reducedDim(germcells, 'corrected_2') <- reducedDim(germcells, 'corrected')[, 1:49]
germcells_slingshot <- slingshot::slingshot(
    germcells, 
    reducedDim = 'corrected_2', 
    clusterLabels = germcells$cluster
)
germcells$pseudotime <- slingshot::slingPseudotime(germcells_slingshot)[, 'Lineage1']

pca_curve <- slingCurves(germcells_slingshot, as.df = TRUE)
colnames(pca_curve) <- paste0('PC', 1:ncol(pca_curve))
tsne_curve <- slingshot::embedCurves(germcells_slingshot, 'TSNE', smoother = 'loess', span = 0.1) %>% slingCurves(as.df = TRUE)
tsne_curve <- tsne_curve[order(tsne_curve$Order), ]
colnames(tsne_curve) <- paste0('TSNE', 1:ncol(tsne_curve))
df <- tibble(
    PC1 = reducedDim(germcells, 'corrected')[,1], 
    PC2 = reducedDim(germcells, 'corrected')[,2], 
    TSNE1 = reducedDim(germcells, 'TSNE')[,1], 
    TSNE2 = reducedDim(germcells, 'TSNE')[,2], 
    cluster = germcells$cluster, 
    pseudotime = germcells$pseudotime
)
p <- cowplot::plot_grid(
    df %>% 
        ggplot() + 
        geom_point(aes(PC1, PC2, col = cluster)) + 
        geom_path(data = pca_curve, aes(x = PC1, y = PC2)) + 
        theme_bw() + 
        coord_fixed(),
    df %>% 
        ggplot() + 
        geom_point(aes(TSNE1, TSNE2, col = cluster)) + 
        geom_path(data = tsne_curve, aes(x = TSNE1, y = TSNE2)) + 
        theme_bw() + 
        coord_fixed(),
    df %>% 
        ggplot() + 
        geom_point(aes(PC1, PC2, col = pseudotime)) + 
        geom_path(data = pca_curve, aes(x = PC1, y = PC2)) + 
        theme_bw() + 
        coord_fixed(),
    df %>% 
        ggplot() + 
        geom_point(aes(TSNE1, TSNE2, col = pseudotime)) + 
        geom_path(data = tsne_curve, aes(x = TSNE1, y = TSNE2)) + 
        theme_bw() + 
        coord_fixed()
)

Pseudotime inference and expression modelling

The pseudotime is a metric describing the relative position of a cell in the trajectory, where cells with larger values are consider to be “after” their counterparts with smaller values. In trajectories describing time-dependent processes like differentiation, a cell’s pseudotime value is generally used as a proxy for its relative age.

Pseudotime inference

Question
  • Extract the pseudotime values automatically computed by slingshot.
  • Check the distribution of pseudotime values across the different cell clusters. What do you observe? Where you expecting this?
p <- tibble(
    annotation = factor(germcells$annotation, c("Spermatogonia", "Spermatocytes", "Early spermatids", "Late spermatids")), 
    pseudotime = germcells$pseudotime
) %>% 
    ggplot(aes(x = annotation, y = pseudotime, fill = annotation)) + 
    geom_violin(scale = 'width') + 
    geom_boxplot(outlier.shape = NULL, width = 0.1, fill = 'white') + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) 
Question
  • Correct pseudotime values as you would expect it to be.
germcells$pseudotime <- scales::rescale((-1 * slingshot::slingPseudotime(germcells_slingshot)[, 'Lineage1']), c(0, 1))

3. Ordering trajectory with RNA velocity

As we saw earlier, TI does not necessarily know which direction is right for a given trajectory. This can be safely estimated using RNA velocity. For a given gene, a high ratio of unspliced to spliced transcripts indicates that that gene is being actively upregulated. Conversely, a low ratio indicates that the gene is being downregulated as the rate of production and processing of pre-mRNAs cannot compensate for the degradation of mature transcripts. Thus, we can infer that cells with high and low ratios are moving towards a high- and low-expression state, respectively, allowing us to assign directionality to trajectories or even individual cells.

Question
  • Read velociraptor documentation. What do you need to compute RNA velocity scores in R?
  • Import spliced and unspliced counts computed with velocyto in R.
  • Try and compute RNA velocity (on germcells only). What do you see?
looms <- readRDS('data/Guo_testis/testis_velocity-counts.rds')
assays(looms)
rownames(looms) <- rowData(looms)$Gene
testis <- testis[rownames(looms), ]
assay(testis, 'spliced', withDimnames=FALSE) <- assay(looms, 'spliced')
assay(testis, 'unspliced', withDimnames=FALSE) <- assay(looms, 'unspliced')
germcells <- testis[, testis$annotation %in% c("Spermatogonia", "Spermatocytes", "Early spermatids", "Late spermatids")]
velo_out <- velociraptor::scvelo(
    germcells, 
    assay.X = "counts", 
    use.dimred = "corrected", 
    subset.row = scran::getTopHVGs(scran::modelGeneVar(germcells), prop = 0.1), 
    mode = 'dynamical'
)
embedded_velo <- velociraptor::embedVelocity(reducedDim(germcells, "TSNE"), velo_out)
grid.df <- velociraptor::gridVectors(reducedDim(germcells, "TSNE"), embedded_velo, resolution = 30)
p <- scater::plotReducedDim(germcells, 'TSNE', colour_by = "annotation", point_alpha = 0.5) +
    geom_segment(
        data = grid.df, 
        mapping = aes(x = start.1, y = start.2, xend = end.1, yend = end.2), 
        arrow = arrow(length = unit(0.05, "inches"), type = "closed")
    )
Back to top