Homework 3

Goals

Import, filter, normalize and plot cells from human hematopoiesis scATACseq dataset
Compute gene activity scores and check known markers
Run chromVAR to identify TF activity in each cell cluster

1. Process human hematopoiesis dataset

Download data

The data comes from Satpathy et al., Nat. Biotechnol. 2019 (DOI: 10.1038/s41587-019-0206-z). Counts data is available from GEO (GSE: GSE129785).

Download the files related to scATACseq of all human hematopoiesis cells.

mkdir -p data/Homework_3/scATAC/
curl https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz -o data/Homework_3/scATAC/barcodes.tsv.gz
curl https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/GSE129785_scATAC-Hematopoiesis-All.mtx.gz -o data/Homework_3/scATAC/matrix.mtx.gz
curl https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz -o data/Homework_3/scATAC/features.tsv.gz

Process data wih Signac

Import data

Notice how the count matrix is in a different format than previously seen. You will frequently encounter this type of situation. Analysis packages often provide import functions for the different input formats. Check Seurat documentation if needed, to identify which function can be used to import a mtx format.

cnts <- Seurat::Read10X('data/Homework_3/scATAC/')

Does it work? Why? The most likely reason is that the count matrix is not structured exactly like the one which comes straight out of a cellranger count pipeline. It is a very frequent issue occurring when one attempts to analyse public data.

The alternative, longer route is to import each file separately, make sure they are compatible with each other, and merge them manually into a SeuratChromatinAssay.

## - Load required libraries 
library(tidyverse)
library(plyranges)
library(Seurat)
library(Signac)

## - Import counts, barcodes and features separately
barcode <- file.path('data/Homework_3/scATAC/', "barcodes.tsv.gz")
features <- file.path('data/Homework_3/scATAC/', "features.tsv.gz")
matrix <- file.path('data/Homework_3/scATAC/', "matrix.mtx.gz")
cnts <- Matrix::readMM('data/Homework_3/scATAC/matrix.mtx.gz')
barcodes <- vroom::vroom('data/Homework_3/scATAC/barcodes.tsv.gz') %>% as.data.frame()
features <- readLines('data/Homework_3/scATAC/features.tsv.gz')[-1] %>% 
    str_replace('_', ':') %>% 
    str_replace('_', '-') %>% 
    GRanges() %>% 
    mutate(peak = as.character(.) %>% str_replace(':', '-'))

## - Make sure that the row/colnames for the count matrix match those of 
#       barcodes and features
colnames(cnts) <- rownames(barcodes) <- barcodes$Group_Barcode
rownames(cnts) <- names(features) <- features$peak

## - Get human gene annotations (hg19/GRCh37) to feed it into the future `SeuratChromatinAssay`
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg19"

## - Create Chromatin Assay
assay <- Signac::CreateChromatinAssay(
    counts = cnts,
    ranges = features,
    genome = "hg19",
    min.cells = 10, 
    min.features = 10, 
    annotation = annotations
)
hemato <- Seurat::CreateSeuratObject(
    counts = assay,
    assay = 'ATAC',
    meta.data = barcodes
)

You will find that CreateChromatinAssay() function takes a long time to run. This is because it internally computes metrics from the count matrix. You can check the cell metadata to see which ones were calculated (on top of the ones which come from the barcodes data frame).

head(hemato[[]])
colnames(hemato[[]])[!colnames(hemato[[]]) %in% colnames(barcodes)]

Process data

Is the fragments file (which should exist for any scATACseq experiment) available for this experiment? Check the related GEO webpage.

For which analysis are the fragments essential, exactly? Can we still perform normalization/clustering/annotation without them? And motif enrichment analysis?

Since we don’t have the fragments file at hand, most of the QC steps are not available (e.g. TSSEnrichment, NucleosomeSignal or fragment size distribution).

(Note that the fragments file is actually available as raw data from GEO. Feel free to download it and perform further QC).

Still, we do have other pre-computed metrics (e.g. FRIP and depth). Check these metrics and filter the Seurat object (cells and features) as deemed appropriate.

## - Filter data 
quantile(hemato$depth, seq(0, 1, 0.1))
quantile(hemato$nFeature_ATAC, seq(0, 1, 0.1))
quantile(hemato$nCount_ATAC, seq(0, 1, 0.1))
quantile(hemato$FRIP, seq(0, 1, 0.1))
hemato <- subset(hemato, subset = depth > 5000 & depth < 200000) 
hemato <- subset(hemato, subset = nCount_ATAC > 3000 & nCount_ATAC < 100000) 
hemato <- subset(hemato, subset = nFeature_ATAC > 1000 & nFeature_ATAC < 20000) 
hemato <- subset(hemato, subset = FRIP > 0.30 & FRIP < 20000) 

## - Remove peaks with low coverage
hemato <- hemato[rowSums(GetAssayData(hemato, slot = "counts")) > 10, ]
hemato <- hemato[rowSums(GetAssayData(hemato, slot = "counts") > 0) > 10, ]

Now that the dataset is filtered, one can normalize (by using TF-IDF approach) then further reduce the dimensionality for visualization purposes.

## - Normalize data 
hemato <- Signac::RunTFIDF(hemato) 

## - Reduce dimensionality
hemato <- Signac::FindTopFeatures(hemato, min.cutoff = 'q50') 
hemato <- Signac::RunSVD(hemato) 

## - Label clusters according to the original publication
clusters <- c(
    'Cluster1' = 'HSC',
    'Cluster2' = 'MEP',
    'Cluster3' = 'CMP',
    'Cluster4' = 'LMPP',
    'Cluster5' = 'CLP',
    'Cluster6' = 'Pro-B',
    'Cluster7' = 'Pre-B',
    'Cluster8' = 'GMP',
    'Cluster9' = 'MDP',
    'Cluster10' = 'pDC',
    'Cluster11' = 'cDC',
    'Cluster12' = 'Mono.',
    'Cluster13' = 'Mono.',
    'Cluster14' = 'Nai. B',
    'Cluster15' = 'Mem. B',
    'Cluster16' = 'Plasma',
    'Cluster17' = 'Baso.',
    'Cluster18' = 'Imm. NK',
    'Cluster19' = 'Mat. NK',
    'Cluster20' = 'Mat. NK',
    'Cluster21' = 'Nai. CD4',
    'Cluster22' = 'Nai. CD4',
    'Cluster23' = 'Nai. Treg',
    'Cluster24' = 'Mem. CD4',
    'Cluster25' = 'Treg',
    'Cluster26' = 'Nai. CD8',
    'Cluster27' = 'Nai. CD8',
    'Cluster28' = 'Nai. CD8',
    'Cluster29' = 'Mem. CD8',
    'Cluster30' = 'Mem. CD8',
    'Cluster31' = 'Gamm. Del. T'
)
hemato$renamed_clusters <- clusters[hemato$Clusters]

## - Visualize data 
hemato <- Seurat::RunUMAP(hemato, reduction = 'lsi', dims = 2:30)
p <- DimPlot(hemato, group.by = 'renamed_clusters') + coord_fixed(ratio = 1)

What can you observe in the UMAP projection of the dataset? Comment on the separation of some cell types into different spatially-resolved clusters.

2. Compute gene activity scores

Here again, the absence of fragments complicates the computation of gene activity scores. Indeed, Signac’s GeneActivity() function does require scATACseq fragment information.

Without actual fragments, can you think of a way to still compute a ‘gene activity’-like score, approximating ~ the sum of scATACseq fragments overlapping a gene?

One way to do this is to:

Extend gene annotations to cover 2000bp upstream of TSS
For each gene, link it to all the peaks overlapping it to create a gene <-> peak relationship table
Create a joint data table containing [ gene | peak | cell | count ]
Group the data table by cell + gene and sum up the peak counts for each pair
Transform this count table into a SparseMatrix
Create the new Seurat assay containing “gene activity”-like scores and normalize it

Step 1:

## - Extend all the genes in hg19 so that it covers 1000 bp upstream of the genebody 
hg_genes_extended <- ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75) %>% 
    filter(gene_biotype == 'protein_coding') %>% 
    keepStandardChromosomes(pruning.mode = 'coarse') %>% 
    filter(seqnames != 'MT') %>% 
    resize(width = width(.) + 2000, fix = 'start') %>% 
    shift(1000 * ifelse(strand(.) == '+', -1, 1)) %>%
    mutate(id = 1:length(.)) %>% 
    mutate(gene_name = make.unique(gene_name))
seqlevelsStyle(hg_genes_extended) <- 'UCSC'

Step 2:

## - Find overlaps between all the scATAC peaks and the extended genes
ov <- findOverlaps(query = granges(hemato), subject = hg_genes_extended)

## - For each peak-gene pair, add peak name and gene name
ov <- as_tibble(ov) %>% 
    dplyr::rename(peaknb = queryHits, genenb = subjectHits) %>% 
    mutate(gene_name = hg_genes_extended$gene_name[genenb]) %>% 
    mutate(row = granges(hemato)$peak[peaknb])

Step 3:

This step requires to transform the original counts SparseMatrix into a tidy data.table, to enable fast left_join between (1) peak counts / cell and (2) gene <-> peak relationship table (ov)

## - Extract counts for the gene-associated peaks
cnts <- GetAssayData(hemato, 'counts')[unique(ov$peaknb), ]

## - Transform the cnts SparseMatrix into a tidy dense long data.frame
tidy_cnts <- broom::tidy(cnts) 

## - Transform the dense long data.frame into a data.table for faster computation
tidy_cnts <- dtplyr::lazy_dt(tidy_cnts)

## - For each peak/cell count value, associate the corresponding gene
joint_data <- left_join(tidy_cnts, ov, by = 'row')
class(joint_data)
## WATCH OUT: You cannot see what is stored into `joint_data` as it 
## is a deferred data.table, i.e. it will only execute once you pipe it to 
## `as_tibble()`. Once you pipe it, the computation starts and takes a lot of time.

Step 4:

## - Group the joint_table by gene + cell and sum up all the counts
gene_averaged_cnts <- group_by(joint_data, column, gene_name) %>% 
    summarize(n = sum(value)) %>% 
    as_tibble()

Step 5:

## - To transform the data.table into SparseMatrix, it needs to be numerical
gene_averaged_cnts_num <- data.matrix(gene_averaged_cnts)

## - Transform into sparse matrix 
gene_averaged_cnts_sparse <- Matrix::sparseMatrix(
    i = gene_averaged_cnts_num[, 'gene_name'],
    j = gene_averaged_cnts_num[, 'column'],
    x = gene_averaged_cnts_num[, 'n'],
)

## - Add back the actual cell and genes names
colnames(gene_averaged_cnts_sparse) <- sort(unique(gene_averaged_cnts$column))
rownames(gene_averaged_cnts_sparse) <- sort(unique(gene_averaged_cnts$gene_name))

Step 6:

## - Create new assay in hemato
hemato[['GENE']] <- CreateAssayObject(counts = gene_averaged_cnts_sparse)

## - Normalize this new assay with SCTransform
hemato <- SCTransform(hemato, assay = 'GENE')

## - Save the resulting object!
saveRDS(hemato, 'data/Homework_3/scATAC/hemato.rds')

Now the Seurat object contains 3 assays, the first one corresponding to the original scATACseq counts, and the second one containing gene activity pseudocounts, and the third one containing SCT-normalized data. One can now perform “gene differential expression”-like analysis using the SCT assay!

## - Reduce dimensionality to visualize cells in 2D
hemato <- RunPCA(hemato) %>% RunUMAP(reduction = 'pca', dims = 1:50)

## - Find gene markers for Basophils
DefaultAssay(hemato) <- "SCT"
p <- FeaturePlot(hemato, features = 'CD14', reduction = "umap") # Or NCR1 (NKp46), CD8A, CD14, MS4A1
Idents(hemato) <- hemato$renamed_clusters
basophils_genes <- Seurat::FindMarkers(
    object = hemato,
    assay = 'GENE', 
    ident.1 = 'Baso.',
    ident.2 = unique(Idents(hemato)[Idents(hemato) != 'Baso.']),
    min.pct = 0.15, 
    only.pos = TRUE
)

3. chromVAR analysis

Get a SummarizedExperiment of scATACseq counts over peaks

chromVAR works with raw counts stored as a RangedSummarizedExperiment. By now, you should be able to easily extract the raw counts from a specific Seurat assay. Store it in a RangedSummarizedExperiment (~ equivalent to Seurat assay but in Bioconductor).

What are the dimensions of the RangedSummarizedExperiment generated? What are the features and what are the columns?

library(SummarizedExperiment)
sumExp <- SummarizedExperiment(
    assays = list('counts' = GetAssayData(hemato, assay = 'ATAC', slot = 'counts')), 
    rowRanges = GetAssayData(hemato, assay = 'ATAC', slot = 'ranges'), 
    colData = hemato[[]]
)
dim(sumExp)

Add GC bias to peaks

This step is important to correct GC bias associated with each peak.

sumExp <- chromVAR::addGCBias(sumExp, genome = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19)

Map motifs over peaks

Using motifmatchr package, one can map a list of motifs (e.g. from JASPAR database) over a RangedSummarizedExperiment (or anything which can be coerced to a GRanges, actually).

First, you need to import motifs from a public database in R. This can be done (among other ways) with the TFBSTools pacakge.

## - Import motifs from JASPAR database
motifs <- TFBSTools::getMatrixSet(
    JASPAR2020::JASPAR2020,
    list(species = 9606, all_versions = FALSE)
)
names(motifs) <- TFBSTools::name(motifs)

Now map the subset of motifs of interest to the filtered peaks, using motifmatchr::matchMotifs().

Read matchMotifs() documentation and run it on your sumExp. What is the class of the generated object? Its dimensions? What are the features and what are the columns?

library(BiocParallel)
register(MulticoreParam(workers = 16, progressbar = TRUE))
motif_mappings <- motifmatchr::matchMotifs(motifs, sumExp, genome = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19)
class(motif_mappingss)
dim(motif_mappings)

Because there are so many motifs and so many peaks, and in interest of time, we will perform chromVAR analysis on a subset of motifs only. You can manually filter the motifs list to motifs that may be relevant in our context (e.g. EOMES, or GATA2, …).

## - Filter to interesting motifs to keep analysis relatively quick
#       Feel free to select any TF of interest!
motif_mappings_sub <- motif_mappings[, c('EOMES', 'GATA2', 'IRF8', 'EBF1', 'SPI1', 'TCF3', 'CEBPA')]

Search for motifs with high deviation of mapping compared to background

chromVAR’s computeDeviations() function combines (1) the peak counts / cell (stored here in sumExp) and (2) the TF mapping over each peak (stored here in motif_mappings). to assess the mapping deviation for each TF over each cell compared to other cells.

Run computeDeviations() on the set of peaks and motifs. What is the class of the generated object? Its dimensions? What are the features and what are the columns?

## - Find background signal
bg <- chromVAR::getBackgroundPeaks(object = sumExp)
## - For each motif, compute its mapping deviation over the filtered peaks
motif_deviations <- chromVAR::computeDeviations(object = sumExp, annotations = motif_mappings_sub, background_peaks = bg)

Check TF motif enrichment in different cell types

The motif_deviations object can be added as a new assay to the hemato Seurat object. This way, one can rely on Seurat-based plotting functions to plot cells in their preferred dimensional space, and color them using motif deviation scores computed with chromVAR.

hemato[['MOTIF']] <- Seurat::CreateAssayObject(counts = chromVAR::deviationScores(motif_deviations))
DefaultAssay(hemato) <- 'MOTIF'
list_p <- lapply(rownames(hemato), function(motif) {
    FeaturePlot(hemato, features = motif, reduction = "umap") + 
        scale_colour_gradientn(
            colors = c('#190886', '#6F07F8', '#F954A5', '#FF9D66', '#edf118'),
            limits = c(-5, 5), oob = scales::squish
        ) + 
        theme(aspect.ratio = 1)
})
list_p[[length(list_p) + 1]] <- DimPlot(hemato, group.by = 'renamed_clusters', reduction = "umap") + theme(aspect.ratio = 1)
p <- cowplot::plot_grid(plotlist = list_p)

Compare gene activity and TF activity

Plot side-by-side:

A UMAP projection showing where cells along the B lineage are located
A UMAP projection showing EBF1 gene activity per cell
A UMAP projection showing EBF1 motif enrichment (i.e. TF activity) per cell

labels <- c(
    'HSC', 
    'MPP',
    'LMPP', 
    'CLP', 
    'Pro-B', 
    'Pre-B', 
    'Nai. B', 
    'Mem. B', 
    'Plasma'
)
hemato$b_cells_annots <- ifelse(hemato$renamed_clusters %in% labels, hemato$renamed_clusters, NA)
p1 <- DimPlot(hemato, group.by = 'b_cells_annots', reduction = "umap") + theme(aspect.ratio = 1) + scale_colour_discrete(na.value = '#aaaaaa')
DefaultAssay(hemato) <- 'SCT'
p2 <- FeaturePlot(hemato, features = 'EBF1', reduction = "umap") + 
        scale_colour_gradientn(
            colors = c('#dadad8', '#f1eca3', '#e0c067', '#db8035', '#a3160c'),
            limits = c(0, 3), oob = scales::squish
        ) + 
        theme(aspect.ratio = 1)
DefaultAssay(hemato) <- 'MOTIF'
p3 <- FeaturePlot(hemato, features = 'EBF1', reduction = "umap") + 
        scale_colour_gradientn(
            colors = c('#190886', '#6F07F8', '#F954A5', '#FF9D66', '#edf118'),
            limits = c(-5, 5), oob = scales::squish
        ) + 
        theme(aspect.ratio = 1)
p <- cowplot::plot_grid(p1, p2, p3, ncol = 3)