13 Lab 3 - ChIP-seq downstream analysis

Aims

Find motifs enriched in a set of ChIP-seq peaks
Import a dozen of ChIP-seq peak sets in R
Check distribution of peaks comapred to genomic features
Check peak occurrence over tissue-specific regulatory elements

Datasets

modENCODE/modERN TF ChIP-seq database, available here

13.1 Download peaks from ENCODE database

Check out the modENCODE/modERN TF ChIP-seq database.

Inspect the available datasets, by filtering e.g. using the following criteria:

Assay title: TF ChIP-seq
Organism: C. elegans
Genome assembly: ce11
Project: modENCODE

Peak files for 12 ChIP-seq datasets have been already downloaded from this database.

13.2 Find motifs enriched in `xnd-1` ChIP-seq

Check the meme website to identify which tool is best suited to identify a motif de novo in a set of peaks from a ChIP-seq experiment.
What do you need to run xstreme on a set of peaks?

13.2.1 Preparing `meme` input

Import xnd-1 peaks in R as a GRanges object.
Recover ce11 genome sequence using the BSgenome.Celegans.UCSC.ce11 package.
Extract sequence over xnd-1 peaks with the Biostrings package.
Export the sequences as a fasta file.

Answer

library(rtracklayer)
library(Biostrings)
library(BSgenome.Celegans.UCSC.ce11)
xnd1 <- import('data/peaks/modENCODE/xnd-1.bed')
genome <- BSgenome.Celegans.UCSC.ce11
xnd1_seq <- getSeq(genome, xnd1)
names(xnd1_seq) <- as.character(xnd1)
writeXStringSet(xnd1_seq, 'data/peaks/modENCODE/xnd-1.fa')

13.2.2 Running `meme`

Identify motifs enriched in xnd-1 peaks with xstreme using the following options:
- Zero or one occurence per sequence at most for meme
- At most 3 motifs
- A min motif width of 6 for meme
- A max motif width of 15 for meme
- No streme motifs
- Over multiple processors in parallel

Answer

sh

xstreme \
    --p data/peaks/modENCODE/xnd-1.fa \
    --o data/meme/ \
    -minw 12 \
    -maxw 16 \
    --meme-nmotifs 3 \
    --meme-p 12 \
    --meme-mod zoops \
    --streme-nmotifs 0

Check the results. Do they make sense in the light of recent publications? (see here)

13.3 Compare all peaks to genomic features

13.3.1 Import all peaks in R

In R, list all the bed files available for the peaks.
Import each file as a GRanges object.

Answer

library(rtracklayer)
library(tidyverse)
bedfiles <- list.files('data/peaks/modENCODE/', pattern = '*bed')
peaks <- map(bedfiles, ~ import(file.path('data/peaks/modENCODE/', .x)))
names(peaks) <- str_replace(bedfiles, '.bed', '')

13.3.2 Define genomic features in `ce11`

Genomic features can be easily annotated from a set of gene features. ChIPseeker facilitates the annotation of ChIP-seq peaks using gene annotations directly provided in R by TxDb gene annotation packages, e.g. TxDb.Celegans.UCSC.ce11.ensGene

Install the TxDb.Celegans.UCSC.ce11.ensGene package in R
Install ChIPseeker

Answer

BiocManager::install('TxDb.Celegans.UCSC.ce11.ensGene')
BiocManager::install('ChIPseeker')
library(ChIPseeker)
library(TxDb.Celegans.UCSC.ce11.ensGene)

13.3.3 ChIP-seq peak overlaps with genomic features

Use ChIPseeker to annotate ChIP-seq peaks for a single ChIP-seq experiment.
Extract

Answer

ceh48 <- peaks[[1]]
annots_ceh48 <- annotatePeak(ceh48, TxDb = TxDb.Celegans.UCSC.ce11.ensGene)
getAnnoStat(annots_ceh48)

Iterate over all the peak sets to compile their annotations across C. elegans genome. To iterate over each set of peaks in the peaks list and return an aggregate data.frame, use the imap_dfr.

Answer

annots <- imap_dfr(peaks, ~ annotatePeak(.x, TxDb = TxDb.Celegans.UCSC.ce11.ensGene) |> 
    getAnnoStat() |> 
    mutate(TF = .y)
)
annots

Generate a barplot to represent the % of peaks in each type of genomic features, for the 12 TFs.

Answer

ggplot(annots, aes(x = TF, y = Frequency, fill = Feature)) + 
    geom_col() + 
    guides(x =  guide_axis(angle = 90))

Comment on the distribution of the 12 sets of peaks over genomic features.

13.3.4 Peaks occurrence over tissue-specific regulatory elements

Import regulatory elements annotated in C. elegans in R, as seen in previous Lab.

Answer

library(readxl)
download.file('https://genome.cshlp.org/content/suppl/2020/11/16/gr.265934.120.DC1/Supplemental_Table_S2.xlsx', 'data/WBcel235_REs.xlsx')
REs <- read_xlsx('data/WBcel235_REs.xlsx', skip = 2, col_names = TRUE)
REs <- makeGRangesFromDataFrame(
    REs, seqnames.field = 'chrom_ce11', 
    start.field = 'start_ce11', end.field = 'end_ce11',
    keep.extra.columns = TRUE
)
REs

Check for overlap of each regulatory element with xnd-1 peaks.
Check the overlap of germline-specific regulatory element with xnd-1 peaks.
Check the enrichment of germline-specific regulatory element over xnd-1 peaks.

Answer

table(REs %over% peaks[['xnd-1']])
table(REs %over% peaks[['xnd-1']], REs$Annotation)
table(REs %over% peaks[['xnd-1']], REs$Annotation == 'Germline') 
table(REs %over% peaks[['xnd-1']], REs$Annotation == 'Germline') |> fisher.test()

For each tissue (Germline, Neurons, Muscle, Hypod. and Intest.), check whether the tissue-specific REs are enriched in xnd-1 peak.

Tip

You can iterate over each tissue and return an aggregated data.frame using map_dfr.
For each iteration, you can transform the result of fisher.test() into a tibble with the glance function from broom package.

Answer

tissues <- c("Germline", "Neurons", "Muscle", "Hypod.", "Intest.")
df <- map_dfr(tissues, function(tissue) {
    fisher.test(
        REs %over% peaks[['xnd-1']], 
        REs$Annotation == tissue
    ) |> 
        broom::glance() |> 
        mutate(tissue = tissue, TF = 'xnd-1') |> 
        dplyr::select(tissue, TF, estimate, p.value)
})

Perform the same operation by iterating over each peak set in the list of imported peaks.

Tip

You can use two nested map_dfr, iterating first over each TF then over each tissue.

Answer

TFs <- names(peaks)
tissues <- c("Germline", "Neurons", "Muscle", "Hypod.", "Intest.")
df <- map_dfr(TFs, function(TF) {
    map_dfr(tissues, function(tissue) {
        fisher.test(
            REs %over% peaks[[TF]], 
            REs$Annotation == tissue
        ) |> 
            broom::glance() |> 
            mutate(tissue = tissue, TF = TF) |> 
            dplyr::select(tissue, TF, estimate, p.value)
    })
})

For each TF, filter the tissues in which it is preferentially enriched (odds ratio >= 2, p.value <= 0.05)
Find TFs enriched over Intestine REs

Answer

filter(df, estimate >= 2, p.value <= 0.05)
filter(df, estimate >= 2, p.value <= 0.05, tissue == "Intest.")

Check the STRING DB website to assess whether these TFs have been shown to interact together. Comment.

13.1 Download peaks from ENCODE database

13.2 Find motifs enriched in xnd-1 ChIP-seq

13.2.1 Preparing meme input

13.2.2 Running meme

13.3 Compare all peaks to genomic features

13.3.1 Import all peaks in R

13.3.2 Define genomic features in ce11

13.3.3 ChIP-seq peak overlaps with genomic features

13.3.4 Peaks occurrence over tissue-specific regulatory elements

13.2 Find motifs enriched in `xnd-1` ChIP-seq

13.2.1 Preparing `meme` input

13.2.2 Running `meme`

13.3.2 Define genomic features in `ce11`