14 Lab 5: Comparing sets of ChIP-seq peaks

Aims

Import a dozen of ChIP-seq peak sets in R
Check distribution of peaks compared to genomic features
Check peak occurrence over tissue-specific regulatory elements

Datasets

modENCODE/modERN TF ChIP-seq database, available here

14.1 Download peaks from ENCODE database

Check out the modENCODE/modERN TF ChIP-seq database.

Inspect the available datasets, by filtering e.g. using the following criteria:

Assay title: TF ChIP-seq
Organism: C. elegans
Genome assembly: ce11
Project: modENCODE

Peak files for 12 ChIP-seq datasets have been already downloaded from this database.

14.2 Import all peaks in R

In R, list all the bed files available for the peaks.
Import each file as a GRanges object.

Answer

library(rtracklayer)
library(tidyverse)
bedfiles <- list.files('Share/peaks/modENCODE/', pattern = '*bed')
peaks <- map(bedfiles, ~ import(file.path('Share/peaks/modENCODE/', .x)))
names(peaks) <- str_replace(bedfiles, '.bed', '')

14.3 Define genomic features in `ce11`

Genomic features can be easily annotated from a set of gene features. ChIPseeker facilitates the annotation of ChIP-seq peaks using gene annotations directly provided in R by TxDb gene annotation packages, e.g. TxDb.Celegans.UCSC.ce11.ensGene

Install the TxDb.Celegans.UCSC.ce11.ensGene package in R
Install ChIPseeker

Answer

library(ChIPseeker)
library(TxDb.Celegans.UCSC.ce11.ensGene)

14.4 ChIP-seq peak overlaps with genomic features

Use ChIPseeker to annotate ChIP-seq peaks for a single ChIP-seq experiment.
Extract

Answer

ceh48 <- peaks[[1]]
annots_ceh48 <- annotatePeak(ceh48, TxDb = TxDb.Celegans.UCSC.ce11.ensGene)
getAnnoStat(annots_ceh48)

Iterate over all the peak sets to compile their annotations across C. elegans genome. To iterate over each set of peaks in the peaks list and return an aggregate data.frame, use the imap_dfr.

Answer

annots <- imap_dfr(peaks, ~ annotatePeak(.x, TxDb = TxDb.Celegans.UCSC.ce11.ensGene) |> 
    getAnnoStat() |> 
    mutate(TF = .y)
)
annots

Generate a barplot to represent the % of peaks in each type of genomic features, for the 12 TFs.

Answer

library(ggplot2)
p <- ggplot(annots, aes(x = TF, y = Frequency, fill = Feature)) + 
    geom_col() + 
    guides(x =  guide_axis(angle = 90))

Comment on the distribution of the 12 sets of peaks over genomic features.

14.5 Peaks occurrence over tissue-specific regulatory elements

Import regulatory elements annotated in C. elegans in R, as seen in previous Lab.

Answer

library(readxl)
download.file('https://genome.cshlp.org/content/suppl/2020/11/16/gr.265934.120.DC1/Supplemental_Table_S2.xlsx', 'WBcel235_REs.xlsx')
REs <- read_xlsx('WBcel235_REs.xlsx', skip = 2, col_names = TRUE)
REs <- makeGRangesFromDataFrame(
    REs, seqnames.field = 'chrom_ce11', 
    start.field = 'start_ce11', end.field = 'end_ce11',
    keep.extra.columns = TRUE
)
REs

Check for overlap of each regulatory element with xnd-1 peaks.
Check the overlap of germline-specific regulatory element with xnd-1 peaks.
Check the enrichment of germline-specific regulatory element over xnd-1 peaks.

Answer

table(REs %over% peaks[['xnd-1']])
table(REs %over% peaks[['xnd-1']], REs$Annotation)
table(REs %over% peaks[['xnd-1']], REs$Annotation == 'Germline') 
table(REs %over% peaks[['xnd-1']], REs$Annotation == 'Germline') |> fisher.test()

For each tissue (Germline, Neurons, Muscle, Hypod. and Intest.), check whether the tissue-specific REs are enriched in xnd-1 peak.

Tip

You can iterate over each tissue and return an aggregated data.frame using map_dfr.
For each iteration, you can transform the result of fisher.test() into a tibble with the glance function from broom package.

Answer

tissues <- c("Germline", "Neurons", "Muscle", "Hypod.", "Intest.")
df <- map_dfr(tissues, function(tissue) {
    fisher.test(
        REs %over% peaks[['xnd-1']], 
        REs$Annotation == tissue
    ) |> 
        broom::glance() |> 
        mutate(tissue = tissue, TF = 'xnd-1') |> 
        dplyr::select(tissue, TF, estimate, p.value)
})
df

Perform the same operation by iterating over each peak set in the list of imported peaks.

Tip

You can use two nested map_dfr, iterating first over each TF then over each tissue.

Answer

TFs <- names(peaks)
tissues <- c("Germline", "Neurons", "Muscle", "Hypod.", "Intest.")
df <- map_dfr(TFs, function(TF) {
    map_dfr(tissues, function(tissue) {
        fisher.test(
            REs %over% peaks[[TF]], 
            REs$Annotation == tissue
        ) |> 
            broom::glance() |> 
            mutate(tissue = tissue, TF = TF) |> 
            dplyr::select(tissue, TF, estimate, p.value)
    })
})
df

For each TF, filter the tissues in which it is preferentially enriched (odds ratio >= 2, p.value <= 0.05)
Find TFs enriched over Intestine REs

Answer

filter(df, estimate >= 2, p.value <= 0.05)
filter(df, estimate >= 2, p.value <= 0.05, tissue == "Intest.")

Check the STRING DB website to assess whether these TFs have been shown to interact together. Comment.

14.1 Download peaks from ENCODE database

14.2 Import all peaks in R

14.3 Define genomic features in ce11

14.4 ChIP-seq peak overlaps with genomic features

14.5 Peaks occurrence over tissue-specific regulatory elements

14.3 Define genomic features in `ce11`