3  Demo 1 - Processing of MNase-seq data

Aims
  • Fetching an MNase-seq dataset from GEO
  • Indexing a genome with bowtie2
  • Map paired-end reads with bowtie2
  • Generate sequencing-depth normalized track
  • Generate nucleosomes track
Datasets

We will process a data from the Henikoff lab, generated in 2011 and published in PNAS.

3.1 Getting data

3.1.1 Downloading reads from internet

We can download the paired-end reads (R1 and R2 fastq files) directly from the internet.

sh
cd ~/EpigenomicsDataAnalysis_2023
mkdir data/
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/002/SRR3193262/SRR3193262_1.fastq.gz -o data/MNase_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/002/SRR3193262/SRR3193262_2.fastq.gz -o data/MNase_R2.fq.gz

3.1.2 FastQC reads

fastqc program will run quick QCs on each fastq file separately.

sh
mkdir data/fastqc
fastqc \
    --outdir data/fastqc \
    --noextract \
    --threads 12 \
    --adapters /home/rsg/repos/autobcl2fastq/adapters.txt \
    data/*fq.gz 1>&2

3.2 Pre-process reads

3.2.1 Trimming reads with trim_galore

sh
trim_galore \
    --cores 8 \
    --length 20 \
    --gzip \
    --paired \
    --output_dir data/trimmed/ \
    data/MNase_R1.fq.gz data/MNase_R2.fq.gz

3.2.2 FastQC on trimmed reads

sh
fastqc \
    --outdir data/fastqc \
    --noextract \
    --threads 12 \
    --adapters /home/rsg/repos/autobcl2fastq/adapters.txt \
    data/trimmed/*fq.gz 1>&2

3.3 Align reads to a genome reference

3.3.1 Indexing sacCer3 genome

Genome references for model systems can be fetched from iGenomes.

sh
## - Fetching data from iGenomes
mkdir data/genome
curl -L http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae_Ensembl_R64-1-1.tar.gz -o data/genome/R64-1-1.tar.gz

## - Unpacking data 
tar -C data/genome/ -xzf data/genome/R64-1-1.tar.gz
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa data/genome/R64-1-1.fa
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf data/genome/R64-1-1.gtf

## - Indexing genome
bowtie2-build data/genome/R64-1-1.fa data/genome/R64-1-1

3.3.2 Mapping paired-end trimmed reads

sh
## - Mapping fastq files to reference genome
mkdir data/mapping
bowtie2 \
    --threads 16 \
    -x data/genome/R64-1-1 \
    -1 data/trimmed/MNase_R1_val_1.fq.gz \
    -2 data/trimmed/MNase_R2_val_2.fq.gz \
    > data/mapping/MNase_R64-1-1.sam

3.3.3 Filtering mapped fragments

sh
## - Fixing mates
#| -m: "Add mate score tag"
samtools fixmate \
    -@ 16 --output-fmt bam \
    -m \
    data/mapping/MNase_R64-1-1.sam data/mapping/MNase_R64-1-1.bam

## - Sorting read pairs
samtools sort \
    -@ 16 --output-fmt bam \
    data/mapping/MNase_R64-1-1.bam \
    -o data/mapping/MNase_R64-1-1_sorted.bam 

## - Removing PCR & optical duplicates
#| -s: "Report stats"
#| -r: "Remove duplicate reads"
samtools markdup \
    -@ 16 --output-fmt bam \
    -s -r \
    data/mapping/MNase_R64-1-1_sorted.bam \
    data/mapping/MNase_R64-1-1_sorted_noDups.bam 

## - Filter read pairs
#| -f 0x001: "Keep read paired"
#| -f 0x002: "Keep read mapped in proper pair"
#| -F 0x004: "Remove read unmapped"
#| -F 0x008: "Remove mate unmapped"
#| -q 10: "MAPQ > 20"
#| --fast: "Use fast bam compression"
samtools view \
    -@ 16 --output-fmt bam \
    -f 0x001 -f 0x002 -F 0x004 -F 0x008 -q 20 \
    --fast \
    data/mapping/MNase_R64-1-1_sorted_noDups.bam \
    -o data/mapping/MNase_R64-1-1_sorted_noDups_filtered.bam

## - Sorting read pairs
#| -l 9: "Use best compression "
samtools sort \
    -@ 16 --output-fmt bam \
    -l 9 \
    data/mapping/MNase_R64-1-1_sorted_noDups_filtered.bam \
    -o data/mapping/MNase_R64-1-1_sorted_noDups_filtered_sorted.bam

## - Indexing bam file
samtools index -@ 16 data/mapping/MNase_R64-1-1_sorted_noDups_filtered_sorted.bam

3.3.4 Create coverage track

sh
## - Generate coverage
mkdir data/tracks
bamCoverage \
    --bam data/mapping/MNase_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_R64-1-1_sorted_noDups_filtered_sorted.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads

3.3.5 Create nucleosome track

sh
bamCoverage \
    --bam data/mapping/MNase_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_R64-1-1_sorted_noDups_filtered_sorted.135-160bp.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads 40 \
    --centerReads \
    --minFragmentLength 130 \
    --maxFragmentLength 165