4 Lab 1 - Processing of MNase-seq data

Aims

Fetching an MNase-seq dataset from GEO
Indexing a genome with bowtie2
Map paired-end reads with bowtie2
Generate sequencing-depth normalized track
Generate nucleosomes track
Check the relevance of filtering out duplicates

Datasets

We will process a data from the Henikoff lab, generated in 2011 and published in PNAS.

GSM754391

4.1 Getting data

4.1.1 Downloading reads from internet

We can download the paired-end reads (R1 and R2 fastq files) directly from the internet.

Find the download links associated with the SRR3193263 SRR ID. You can go to SRA-explorer to easily recover links.
Download the two fastq files for the SRR3193263 SRR ID.

Answer

sh

cd ~/EpigenomicsDataAnalysis_2023
mkdir data/
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/003/SRR3193263/SRR3193263_1.fastq.gz -o data/MNase_2.5min_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/003/SRR3193263/SRR3193263_2.fastq.gz -o data/MNase_2.5min_R2.fq.gz

4.1.2 FastQC reads

Run fastqc on each fastq file individually.

Answer

sh

mkdir data/fastqc
fastqc \
    --outdir data/fastqc \
    --noextract \
    --threads 12 \
    data/*fq.gz 1>&2

4.2 Pre-process reads

4.2.1 Trimming reads with `trim_galore`

Did you find any adapter contamination in the two original fastq files?
If so, proceed to fastq file trimming with cutadapt. Read its doc to see how to automatically run fastqc after trimming reads.

Answer

sh

trim_galore \
    --cores 8 \
    --length 20 \
    --gzip \
    --paired \
    --fastqc \
    --fastqc_args "--outdir data/fastqc --noextract --threads 12" \
    --output_dir data/trimmed/ \
    data/MNase_2.5min_R1.fq.gz data/MNase_2.5min_R2.fq.gz

4.3 Align reads to a genome reference

4.3.1 Indexing sacCer3 genome

Genome references for model systems can be fetched from iGenomes.

Build S. cerevisiae genome (R64-1-1) bowtie2 index.

Answer

sh

## - Fetching data from iGenomes
mkdir data/genome
curl -L http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae_Ensembl_R64-1-1.tar.gz -o data/genome/R64-1-1.tar.gz

## - Unpacking data 
tar -C data/genome/ -xzf data/genome/R64-1-1.tar.gz
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa data/genome/R64-1-1.fa
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf data/genome/R64-1-1.gtf

## - Indexing genome
bowtie2-build data/genome/R64-1-1.fa data/genome/R64-1-1

4.3.2 Mapping paired-end trimmed reads

Map paired-end reads with bowtie2.

Answer

sh

## - Mapping fastq files to reference genome
mkdir data/mapping
bowtie2 \
    --threads 16 \
    -x data/genome/R64-1-1 \
    -1 data/trimmed/MNase_2.5min_R1_val_1.fq.gz \
    -2 data/trimmed/MNase_2.5min_R2_val_2.fq.gz \
    > data/mapping/MNase_2.5min_R64-1-1.sam

4.3.3 Filtering mapped fragments

Filter mapped pairs using the following procedure:
- Fixing mates
- Sorting reads
- Removing duplicates
- Filtering pairs:
  - Only keep paired reads (0x001)
  - Only keep reads mapped in proper pair (0x002)
  - No unmapped reads (0x004)
  - No reads with unmapped mate (0x008)
  - Reads mapped with a MAPQ >= 20
- Sorting reads
- Indexing reads

To better understand the combinations of information described by the SAM “flag”, check the Decoding SAM flags page.

Answer

sh

## - Fixing mates
#| -m: "Add mate score tag"
samtools fixmate \
    -@ 16 --output-fmt bam \
    -m \
    data/mapping/MNase_2.5min_R64-1-1.sam data/mapping/MNase_2.5min_R64-1-1.bam

## - Sorting read pairs
samtools sort \
    -@ 16 --output-fmt bam \
    data/mapping/MNase_2.5min_R64-1-1.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted.bam 

## - Removing PCR & optical duplicates
#| -s: "Report stats"
#| -r: "Remove duplicate reads"
samtools markdup \
    -@ 16 --output-fmt bam \
    -s -r \
    data/mapping/MNase_2.5min_R64-1-1_sorted.bam \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups.bam 

## - Filter read pairs
#| -f 0x001: "Keep read paired"
#| -f 0x002: "Keep read mapped in proper pair"
#| -F 0x004: "Remove read unmapped"
#| -F 0x008: "Remove mate unmapped"
#| -q 20: "MAPQ >= 20"
#| --fast: "Use fast bam compression"
samtools view \
    -@ 16 --output-fmt bam \
    -f 0x001 -f 0x002 -F 0x004 -F 0x008 -q 20 \
    --fast \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered.bam

## - Sorting read pairs
#| -l 9: "Use best compression "
samtools sort \
    -@ 16 --output-fmt bam \
    -l 9 \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam

## - Indexing bam file
samtools index -@ 16 data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam

4.3.4 Create coverage track

Create a sequencing depth-normalized track, filling out fragments.

Answer

sh

## - Generate coverage
mkdir data/tracks
bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads

4.3.5 Create nucleosome track

Create a nucleosome track by keeping the fragments between 130 and 165bp, and extending them to 40bp aligned at their center.

Answer

sh

bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.135-160bp.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads 40 \
    --centerReads \
    --minFragmentLength 130 \
    --maxFragmentLength 165

4.3.6 Create coverage track without removing duplicates

Reprocess the bam file to generate coverage and nucleosome tracks from fragments without removing the read duplicates.
Compare the different tracks generated in IGV. Comment.

Answer

sh

## - Filter read pairs
#| -f 0x001: "Keep read paired"
#| -f 0x002: "Keep read mapped in proper pair"
#| -F 0x004: "Remove read unmapped"
#| -F 0x008: "Remove mate unmapped"
#| -q 10: "MAPQ > 20"
#| --fast: "Use fast bam compression"
samtools view \
    -@ 16 --output-fmt bam \
    -f 0x001 -f 0x002 -F 0x004 -F 0x008 -q 20 \
    --fast \
    data/mapping/MNase_2.5min_R64-1-1_sorted.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_filtered.bam

## - Sorting read pairs
#| -l 9: "Use best compression "
samtools sort \
    -@ 16 --output-fmt bam \
    -l 9 \
    data/mapping/MNase_2.5min_R64-1-1_sorted_filtered.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam

## - Indexing bam file
samtools index -@ 16 data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam

## - Generate coverage
mkdir data/tracks
bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_filtered_sorted.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads

4.1 Getting data

4.1.1 Downloading reads from internet

4.1.2 FastQC reads

4.2 Pre-process reads

4.2.1 Trimming reads with trim_galore

4.3 Align reads to a genome reference

4.3.1 Indexing sacCer3 genome

4.3.2 Mapping paired-end trimmed reads

4.3.3 Filtering mapped fragments

4.3.4 Create coverage track

4.3.5 Create nucleosome track

4.3.6 Create coverage track without removing duplicates

4.2.1 Trimming reads with `trim_galore`