4  Lab 1 - Processing of MNase-seq data

Aims
  • Fetching an MNase-seq dataset from GEO
  • Indexing a genome with bowtie2
  • Map paired-end reads with bowtie2
  • Generate sequencing-depth normalized track
  • Generate nucleosomes track
  • Check the relevance of filtering out duplicates
Datasets

We will process a data from the Henikoff lab, generated in 2011 and published in PNAS.

4.1 Getting data

4.1.1 Downloading reads from internet

We can download the paired-end reads (R1 and R2 fastq files) directly from the internet.

  • Find the download links associated with the SRR3193263 SRR ID. You can go to SRA-explorer to easily recover links.
  • Download the two fastq files for the SRR3193263 SRR ID.
sh
cd ~/EpigenomicsDataAnalysis_2023
mkdir data/
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/003/SRR3193263/SRR3193263_1.fastq.gz -o data/MNase_2.5min_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR319/003/SRR3193263/SRR3193263_2.fastq.gz -o data/MNase_2.5min_R2.fq.gz

4.1.2 FastQC reads

  • Run fastqc on each fastq file individually.
sh
mkdir data/fastqc
fastqc \
    --outdir data/fastqc \
    --noextract \
    --threads 12 \
    data/*fq.gz 1>&2

4.2 Pre-process reads

4.2.1 Trimming reads with trim_galore

  • Did you find any adapter contamination in the two original fastq files?
  • If so, proceed to fastq file trimming with cutadapt. Read its doc to see how to automatically run fastqc after trimming reads.
sh
trim_galore \
    --cores 8 \
    --length 20 \
    --gzip \
    --paired \
    --fastqc \
    --fastqc_args "--outdir data/fastqc --noextract --threads 12" \
    --output_dir data/trimmed/ \
    data/MNase_2.5min_R1.fq.gz data/MNase_2.5min_R2.fq.gz

4.3 Align reads to a genome reference

4.3.1 Indexing sacCer3 genome

Genome references for model systems can be fetched from iGenomes.

  • Build S. cerevisiae genome (R64-1-1) bowtie2 index.
sh
## - Fetching data from iGenomes
mkdir data/genome
curl -L http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae_Ensembl_R64-1-1.tar.gz -o data/genome/R64-1-1.tar.gz

## - Unpacking data 
tar -C data/genome/ -xzf data/genome/R64-1-1.tar.gz
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Sequence/WholeGenomeFasta/genome.fa data/genome/R64-1-1.fa
cp data/genome/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Annotation/Genes/genes.gtf data/genome/R64-1-1.gtf

## - Indexing genome
bowtie2-build data/genome/R64-1-1.fa data/genome/R64-1-1

4.3.2 Mapping paired-end trimmed reads

  • Map paired-end reads with bowtie2.
sh
## - Mapping fastq files to reference genome
mkdir data/mapping
bowtie2 \
    --threads 16 \
    -x data/genome/R64-1-1 \
    -1 data/trimmed/MNase_2.5min_R1_val_1.fq.gz \
    -2 data/trimmed/MNase_2.5min_R2_val_2.fq.gz \
    > data/mapping/MNase_2.5min_R64-1-1.sam

4.3.3 Filtering mapped fragments

  • Filter mapped pairs using the following procedure:

    • Fixing mates

    • Sorting reads

    • Removing duplicates

    • Filtering pairs:

      • Only keep paired reads (0x001)
      • Only keep reads mapped in proper pair (0x002)
      • No unmapped reads (0x004)
      • No reads with unmapped mate (0x008)
      • Reads mapped with a MAPQ >= 20
    • Sorting reads

    • Indexing reads

To better understand the combinations of information described by the SAM “flag”, check the Decoding SAM flags page.

sh
## - Fixing mates
#| -m: "Add mate score tag"
samtools fixmate \
    -@ 16 --output-fmt bam \
    -m \
    data/mapping/MNase_2.5min_R64-1-1.sam data/mapping/MNase_2.5min_R64-1-1.bam

## - Sorting read pairs
samtools sort \
    -@ 16 --output-fmt bam \
    data/mapping/MNase_2.5min_R64-1-1.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted.bam 

## - Removing PCR & optical duplicates
#| -s: "Report stats"
#| -r: "Remove duplicate reads"
samtools markdup \
    -@ 16 --output-fmt bam \
    -s -r \
    data/mapping/MNase_2.5min_R64-1-1_sorted.bam \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups.bam 

## - Filter read pairs
#| -f 0x001: "Keep read paired"
#| -f 0x002: "Keep read mapped in proper pair"
#| -F 0x004: "Remove read unmapped"
#| -F 0x008: "Remove mate unmapped"
#| -q 20: "MAPQ >= 20"
#| --fast: "Use fast bam compression"
samtools view \
    -@ 16 --output-fmt bam \
    -f 0x001 -f 0x002 -F 0x004 -F 0x008 -q 20 \
    --fast \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered.bam

## - Sorting read pairs
#| -l 9: "Use best compression "
samtools sort \
    -@ 16 --output-fmt bam \
    -l 9 \
    data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam

## - Indexing bam file
samtools index -@ 16 data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam

4.3.4 Create coverage track

  • Create a sequencing depth-normalized track, filling out fragments.
sh
## - Generate coverage
mkdir data/tracks
bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads

4.3.5 Create nucleosome track

  • Create a nucleosome track by keeping the fragments between 130 and 165bp, and extending them to 40bp aligned at their center.
sh
bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_noDups_filtered_sorted.135-160bp.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads 40 \
    --centerReads \
    --minFragmentLength 130 \
    --maxFragmentLength 165

4.3.6 Create coverage track without removing duplicates

  • Reprocess the bam file to generate coverage and nucleosome tracks from fragments without removing the read duplicates.
  • Compare the different tracks generated in IGV. Comment.
sh
## - Filter read pairs
#| -f 0x001: "Keep read paired"
#| -f 0x002: "Keep read mapped in proper pair"
#| -F 0x004: "Remove read unmapped"
#| -F 0x008: "Remove mate unmapped"
#| -q 10: "MAPQ > 20"
#| --fast: "Use fast bam compression"
samtools view \
    -@ 16 --output-fmt bam \
    -f 0x001 -f 0x002 -F 0x004 -F 0x008 -q 20 \
    --fast \
    data/mapping/MNase_2.5min_R64-1-1_sorted.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_filtered.bam

## - Sorting read pairs
#| -l 9: "Use best compression "
samtools sort \
    -@ 16 --output-fmt bam \
    -l 9 \
    data/mapping/MNase_2.5min_R64-1-1_sorted_filtered.bam \
    -o data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam

## - Indexing bam file
samtools index -@ 16 data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam

## - Generate coverage
mkdir data/tracks
bamCoverage \
    --bam data/mapping/MNase_2.5min_R64-1-1_sorted_filtered_sorted.bam \
    --outFileName data/tracks/MNase_2.5min_R64-1-1_sorted_filtered_sorted.CPM.bw \
    --binSize 1 \
    --numberOfProcessors 16 \
    --normalizeUsing CPM \
    --skipNonCoveredRegions \
    --extendReads