Homework 2

Goals

Import counts from mouse E18 brain scATACseq (5K) (provided by 10X Genomics) in R and process them

Processed data for mouse E18 brain scATACseq can be obtained directly from 10X Genomics

Transfer annotations from mouse E18 brain scRNAseq (5K) to mouse E18 brain scATACseq

Processed data for mouse E18 brain scRNAseq can be obtained directly from 10X Genomics

Find genes associated to DA peaks and run GO over-enrichment / gene set enrichment analysis

1. Process mouse E18 brain 5K scATACseq data

Download data

scATACseq for mouse E18 5K brain can be downloaded from 10X Genomics.

mkdir -p data/Homework_2/scATAC
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_analysis.tar.gz -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_analysis.tar.gz
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_filtered_peak_bc_matrix.h5 -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_filtered_peak_bc_matrix.h5
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_filtered_tf_bc_matrix.h5 -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_filtered_tf_bc_matrix.h5
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_fragments.tsv.gz -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_fragments.tsv.gz
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_fragments.tsv.gz.tbi -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_fragments.tsv.gz.tbi
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_peak_annotation.tsv -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_peak_annotation.tsv
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_peaks.bed -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_peaks.bed
curl https://cg.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_possorted_bam.bam -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_possorted_bam.bam
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_possorted_bam.bam.bai -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_possorted_bam.bam.bai
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_singlecell.csv -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_singlecell.csv
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_web_summary.html -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_web_summary.html
curl https://cf.10xgenomics.com/samples/cell-atac/1.2.0/atac_v1_E18_brain_fresh_5k/atac_v1_E18_brain_fresh_5k_cloupe.cloupe -o data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_cloupe.cloupe

Import data in R

First, the count matrix has to be imported. Use Seurat to do import the filtered peak count matrix. How many peaks were found?

library(Signac)
library(Seurat)
library(tidyverse)
library(BiocParallel)
library(plyranges)

## -- Read brain ATAC-seq counts
brain_assay <- Signac::CreateChromatinAssay(
    counts = Seurat::Read10X_h5("data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_filtered_peak_bc_matrix.h5"),
    fragments = 'data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_fragments.tsv.gz',
    sep = c(":", "-"),
    genome = "mm10",
    min.cells = 1
)
brain_assay
dim(brain_assay)

Cell metadata also needs to be read. Metadata is stored in a regular csv file. You can import it in R using read.csv() from base R, but vroom package wil take care of finding the best and fastest way to read it. It is especially useful for reading large rectangular data in R.

How many cell barcode are there in total, in the *_singlecell.csv file? Should we take this into consideration and filter some out?

## -- Read metadata
metadata <- vroom::vroom(
    file = "data/Homework_2/scATAC/atac_v1_E18_brain_fresh_5k_singlecell.csv",
    col_names = TRUE
) %>% 
    filter(barcode != 'NO_BARCODE') %>% 
    mutate(cell = barcode) %>%
    filter(cell %in% colnames(brain_assay)) %>%
    column_to_rownames('barcode') %>% 
    as.data.frame()

Now that cells and counts over peaks / cell are created, one can bind them together into a Seurat object. Try and create a standard Seurat object containing the chromatinAssay object and corresponding cell metadata.

## -- Create Seurat object
brain <- Seurat::CreateSeuratObject(
    counts = brain_assay,
    assay = 'peaks',
    project = 'ATAC',
    meta.data = metadata
)

## -- Add genome annotations
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79::EnsDb.Mmusculus.v79)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "mm10"
Annotation(brain) <- annotations

QC data

Nucleosome signal

Few QC controls can be done. The nucleosome signal metric estimates the ratio of nucleosome-spanning fragments vs mononucleosomal fragments, for each cell. A low ratio (e.g. < 4) indicates enrichment of fragments over an open chromatin locus.

## -- Nucleosome signal 
brain <- Signac::NucleosomeSignal(object = brain)
brain$nucleosome_group <- ifelse(brain$nucleosome_signal > 4, 'NS > 4', 'NS < 4')
range(brain$nucleosome_signal)
table(brain$nucleosome_group)

Check the number of fragments / cell overlapping a known DHS locus (check brain pre-computed metrics to find the corresponding metric), for cells with varying nucleosome_signal scores.

p<- tibble(dnase = brain$DNase_sensitive_region_fragments, nuc_signal = brain$nucleosome_signal) %>% 
    mutate(
        nuc_group = round(nuc_signal / 5)*5 + 1, 
        nuc_group = factor(nuc_group, seq(1, max(nuc_group)))
    ) %>% 
    ggplot(aes(x = nuc_group, y = dnase)) + 
    geom_boxplot() + 
    labs(x = 'Av. nucleosome signal')

Fragment size distribution

We have covered this already in previous exercises focusing on bulk ATAC. It is a metric that one can also evaluate in scATACseq. For this, one can import the fragments (from *_fragments.tsv.gz file) in R. Here again, we highly recommend the use of vroom package, as the fragment file can be several Gb big!

Try plotting the fragment width for the first 10M fragments.

## -- Frag. size distribution
GetFragmentData(Fragments(brain)[[1]])
frags <- vroom::vroom(
        GetFragmentData(Fragments(brain)[[1]]), 
        col_names = FALSE, 
        n_max = 10000000
    ) %>% 
    setNames(c('seqnames', 'start', 'end', 'cell', 'cluster'))
p <-frags %>% 
    mutate(width = end - start) %>%
    ggplot(aes(x = width)) + 
    geom_bar() + 
    xlim(c(0, 800))

TSSES

This metric has also already been previously mentionned. It essentially represents the signal-to-noise ratio at TSSs. It can be estimated from a Seurat object using the Signac::TSSEnrichment() function.

## -- TSS enrichment
brain <- TSSEnrichment(brain, fast = TRUE)
quantile(brain$TSS.enrichment, probs = seq(0, 1, 0.1))

We can also plot the average coverage at TSSs. Let’s plot it for the first 10M fragments

p<- as_granges(frags) %>% 
    resize(1, fix = 'start') %>% 
    coverage() %>%
    '['(target) %>% 
    data.matrix() %>%
    as_tibble() %>% 
    setNames(seq(-999.5, 999.5, length.out = 2000)) %>%
    mutate(peak = seq(1, nrow(.)), strand = as.character(strand(target))) %>% 
    pivot_longer(-c(peak, strand), names_to = 'distance', values_to = 'cnt') %>% 
    mutate(distance = as.numeric(distance)) %>% 
    mutate(distance = ifelse(strand == '-', -distance, distance)) %>%
    select(-peak, -strand) %>%
    group_by(distance) %>%
    tally(cnt) %>% 
    ggplot(aes(x = distance, y = n)) + 
    geom_line() +
    geom_smooth(method = 'loess', span = 0.05) + 
    theme_minimal() + 
    theme(legend.position = 'none') + 
    labs(title = 'Fragment "cut" sites', x = 'Distance from TSS', y = '# of cut sites')

Subsetting data

Check the distribution of values for important metrics related to scATAC quality. Notably, the # and % of fragments overlapping identified peaks or blacklist regions.

quantile(brain$peak_region_fragments, probs = seq(0, 1, 0.1))
quantile(brain$blacklist_region_fragments, probs = seq(0, 1, 0.1))
brain$FRiP <- brain$peak_region_fragments / brain$passed_filters * 100
quantile(brain$FRiP, probs = seq(0, 1, 0.1))
brain$FRiBl <- brain$blacklist_region_fragments / brain$peak_region_fragments
quantile(brain$FRiBl, probs = seq(0, 1, 0.1))

Set sensible thresholds to remove cells which do not seem to be high-quality, then subset the brain Seurat object using these thresholds.

brain <- subset(
    brain,
    peak_region_fragments > 2000 & # Keep cells with high number of fragments in peaks
    peak_region_fragments < 100000 & # Remove cells with number of fragments in peaks too high
    FRiP > 50 & # Keep cells with high FRiP 
    FRiBl < 0.05 & # Keep cells with low FRiBl
    nucleosome_signal < 4 & # Remove cells with high nucleosome signal
    TSS.enrichment > 2 # Keep cells with good TSSES
)

Don’t forget to also subset some peaks (which may have been identified from the low-quality cells).

brain <- brain[
    rowSums(GetAssayData(brain, slot = "counts") > 0) > 10, # Remove peaks detected in less than 10 cells
]

Latent semantic indexing: normalizing and reduce dimensionality data with ‘TF-IDF’ and ‘SVD’

Normalization is done using a TF-IDF approach, rather than traditional log-normalization, due to excessive sparsity of the scATAC data.

GetAssayData(brain, 'counts')[1:10, 1:10]
GetAssayData(brain, 'data')[1:10, 1:10]
brain <- Signac::RunTFIDF(brain)
GetAssayData(brain, 'counts')[1:10, 1:10]
GetAssayData(brain, 'data')[1:10, 1:10]

Dimensionality reduction can then be performed using the normalized data. By using a singular value decomposition approach, one effectively recapitulates the latent semantic indexing approach.

brain <- Signac::FindTopFeatures(brain, min.cutoff = 'q50') # Find variable features
brain <- Signac::RunSVD(object = brain)

Clustering

Now that cells are embedded in a lower dimensional space, they can be graph-clustered just like “regular” cells from scRNAseq.

Try to perform graph-based clustering on the Seurat object, using the lsi embedded space (skipping the first dimension).

brain <- FindNeighbors(object = brain, reduction = 'lsi', dims = 2:30)
brain <- FindClusters(object = brain, algorithm = 3, resolution = 1.2, verbose = FALSE)

Why was the first dimension skipped here? It appears that in LSI, the first dimension is frequently (anti-)correlated with depth, so we exclude it in further analysis.

Visualization

Now that the data has been normalized and dimensionality is linearly reduced (through SVD), we can attempt to visualize the cluster locations in 2D.

Try computing tSNE and UMAP embeddings of the normalized LSI data, using Seurat-based functions.

brain <- RunTSNE(brain, reduction = 'lsi', dims = 2:30)
brain <- RunUMAP(brain, reduction = 'lsi', dims = 2:30)
p1 <- Seurat::DimPlot(brain, reduction = 'tsne')

One can also leverage Signac-based functions for visualization. Notably, the CoveragePlot() function is quite useful: it can plot in R the coverage signal over a genomic region, aggregated per cell cluster. Try and plot the aggregated signal over individual markers known to be specific to certain brain cell types.

p2 <- CoveragePlot(
    object = brain,
    region = "Tjp1", # Or Gad2 (interneurons), Cq1a (Microglia), Gfap (astrocytes), Tubb3 (neurons)
    annotation = TRUE,
    peaks = FALSE
)

2. Integrate scATACseq and scRNAseq

Download data

scRNAseq for mouse E18 5K brain can be downloaded from 10X Genomics.

mkdir data/Homework_2/scRNA
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_web_summary.html -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_web_summary.html
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_metrics_summary.csv -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_metrics_summary.csv
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_cloupe.cloupe -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_cloupe.cloupe
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_alignments.bam -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_alignments.bam
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_alignments.bam.bai -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_alignments.bam.bai
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_barcodes.csv -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_barcodes.csv
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_feature_bc_matrix.h5 -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_feature_bc_matrix.h5
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_feature_bc_matrix.tar.gz -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_feature_bc_matrix.tar.gz
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_sample_molecule_info.h5 -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_molecule_info.h5
curl https://cf.10xgenomics.com/samples/cell-exp/6.0.0/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K/SC3_v3_NextGem_DI_Neurons_5K_SC3_v3_NextGem_DI_Neurons_5K_count_analysis.tar.gz -o data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_analysis.tar.gz

Process data

Process the scRNAseq data with Seurat using a pipeline approach to limit the generation of intermediate objects. Why is such approach sometimes useful in R?

library(tidyverse)
library(Seurat)
brain_rna <- Read10X_h5("data/Homework_2/scRNA/SC3_v3_Neurons_5K_count_sample_feature_bc_matrix.h5") %>% 
    CreateSeuratObject() %>% 
    SCTransform() %>% 
    RunPCA() %>% 
    FindNeighbors() %>% 
    FindClusters() %>% 
    RunTSNE(reduction = 'pca', dims = 1:50) %>% 
    RunUMAP(reduction = 'pca', dims = 1:50)
p <- DimPlot(brain_rna, reduction = "umap")

Annotate scRNAseq data with public databases

Leverage the scRNAseq package to recover pre-processed and annotated scRNAseq data from Zeisel et al., 2018. We will use this to annotate cell identity of the mouse E18 5K brain scRNAseq unlabelled data from 10X Genomics.

Which class is the newly created object? How different from a Seurat object this one is?

ref <- scRNAseq::ZeiselBrainData()
class(ref)

Don’t forget to normalize the new counts in R! This can be achieved using the scuttle package. One could also coerce the ref object as a Seurat object to apply SCT normalization!

ref <- scuttle::logNormCounts(ref)

With the SingleR package, we can annotate cells using (1) the normalized count SparseMatrix from the mouse E18 5K brain scRNAseq and (2) the SingleCellExperiment ref object and its annotations. Check SingleR documentation to see how this can be done.

Are the scRNAseq new annotations corroborating with the clusters found in the scRNAseq data?

ref$level1class[ref$level1class %in% c('pyramidal CA1', 'pyramidal SS')] <- 'pyramidal'
pred <- SingleR::SingleR(
    test = GetAssayData(brain_rna, 'data'), 
    ref = ref, 
    labels = ref$level1class, 
    BPPARAM = BiocParallel::MulticoreParam(workers = 12)
)
brain_rna$annotation <- pred$pruned.labels
p1 <- DimPlot(brain_rna, reduction = "umap", group.by = 'annotation')
p2 <- DimPlot(brain_rna, reduction = "umap", group.by = 'annotation')

Transfer annotations from scRNAseq to scATACseq datasets

A commonly used trick to transfer annotation from scRNAseq to scATACseq is to process the scATACseq data so that it recapitulates “gene activity”. With Signac, this can be done with the GeneActivity() function. Read its documentation to understand what it does!

Using this function, we can create a second assay, containing “gene activity” metrics for eeach gene in each cell. Now, the features are not accessible peaks anymore, but “gene activity”.

## - Compute an "activity score" for each gene in each cell (based on ATAC counts) and store it as a new assay
gene_activities <- GeneActivity(brain, features = VariableFeatures(brain_rna))
brain@assays["gene_activity"] <- CreateAssayObject(counts = gene_activities)

Don’t forget that these “gene activity” scores are obtained from raw scATACseq counts. One needs to normalize them! In scRNAseq processed in Seurat, the SCT method is typically used to normalize data.

Once you have ran SCTransform(), check the number of assays for the brain object. How many are they? What does each assay contain exactly? Which one is the default one now?

## - Normalize this new assay with SCTransform
GetAssayData(brain, 'counts')[1:10, 1:10]
brain <- SCTransform(brain, assay = 'gene_activity')

Using Seurat-based FindTransferAnchors() and TransferData() functions, find anchors between scRNAseq and scATACseq data, then transfer the annotations from scRNAseq to scATACseq.

## - Find anchors between scRNAseq and "gene activity" from scATACseq. 
anchors <- FindTransferAnchors(
    reference = brain_rna, 
    query = brain, 
    features = VariableFeatures(object = brain_rna),
    reference.assay = "SCT", 
    query.assay = "SCT", 
    reduction = "cca"
)

## - Transfer annotations from scRNA to scATAC
pred_labels <- TransferData(
    anchorset = anchors, 
    refdata = brain_rna$annotation,
    weight.reduction = brain[['lsi']], 
    dims = 2:30
)

## - Save transferred annotations
brain$transferred_annotation <- pred_labels$predicted.id
brain$transferred_annotation[pred_labels$prediction.score.max < 0.8] <- NA

Check the transferred annotations: do they overlap what you would have thought from earlier exploration of known markers?

p <- Seurat::DimPlot(brain, reduction = 'tsne', group.by = 'transferred_annotation')

BONUS QUESTION: have we already used a non-Seurat approach to transfer annotations from one dataset to another? Try and apply it to transfer annotations from scRNAseq to scATACseq data. Does this work?

3. Functional analysis of DA peaks

Extract peaks enriched in microglia vs. others

Seurat implements differential abundance statistical tests for stored assays. Try to use FindMarkers() to find peaks that are over-accessible in microglia cells compared to the rest of the cells.

## - Find microglia-specific peaks 
Idents(brain) <- brain$transferred_annotation
microglia_peaks <- Seurat::FindMarkers(
    object = brain,
    assay = 'peaks', 
    ident.1 = 'microglia',
    ident.2 = unique(Idents(brain)[Idents(brain) != 'microglia']),
    min.pct = 0.15, 
    only.pos = TRUE
)

Find nearest gene for each microglia-specific peak

A key step in scATACseq analysis is to link each peak to the gene it regulates. The most basic way to do this is to associate a peak to the nearest gene. Although this is a rather crude way of doing, it works reasonably well as a “quick-and-dirty” first step to identify genes which are potentially undergoing differential regulation between cell clusters.

## - Link microglia peaks to nearest genes
mm_genes <- ensembldb::genes(EnsDb.Mmusculus.v79::EnsDb.Mmusculus.v79) %>% filter(gene_biotype == 'protein_coding')
seqlevelsStyle(mm_genes) <- 'UCSC'
microglia_genes <- microglia_peaks %>% 
    filter(p_val_adj <= 0.05, avg_log2FC > 1, pct.1 > 0.2) %>% 
    rownames() %>% 
    str_replace('-', ':') %>% 
    GRanges() %>% 
    join_nearest(mm_genes) %>% 
    as_tibble() %>% pull(symbol) %>% 
    unique()

Perform GO over-representation analysis

The easiest way to perform GO over-representation analysis in R is to use gprofiler2 package. It sends the query to the web-based application, which is well maintained, frequently updated and supports a large number of species.

Check whether there are enriched GO annotations associated with the genes undergoing microglia-specific regulation.

## - Run GO analysis
goa <- gprofiler2::gost(microglia_genes, organism = 'mmusculus')
goa$result %>% as_tibble() %>% filter(p_value <= 0.05, grepl('GO:', source)) %>% arrange(p_value) %>% print(n = 40)