10 Lab 1: Hands-on processing of RNA-seq

Aims

Perform raw data QC with FastQC
Manually process RNA-seq reads
Generate stranded RNA-seq tracks with bamCoverage
Inspect tracks with IGV

Datasets

We will use the SRR9929264 RNA-seq data published in Nuño-Cabanes et al., Scientific Data 2020

10.1 Checking quality of raw data with `FastQC`

fastq files have already been downloaded from SRA, using the following command:

bash

curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR992/004/SRR9929264/SRR9929264_1.fastq.gz -o Share/reads/Ctl_rep1_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR992/004/SRR9929264/SRR9929264_2.fastq.gz -o Share/reads/Ctl_rep1_R2.fq.gz

Checkout the Share/reads folder where the reads are stored (for Ctl_rep1 at least).
What type of sequencing data (paired-end, single-end) is it?
How many reads are there in each file?
How long are the reads?

Answer

bash

zcat Share/reads/Ctl_rep1_R1.fq.gz | wc -l
zcat Share/reads/Ctl_rep1_R1.fq.gz | head -n 8

Check FastQC help to understand the different options available.
Run FastQC to generate a QC report of the quality of raw data. Think about the Share/adapters.txt file that you may provide as an option…

Answer

bash

fastqc --help
mkdir qc/
fastqc --outdir qc/ \
    --noextract \
    --threads 2 \
    --adapters Share/adapters.txt \
    Share/reads/Ctl_rep1_R1.fq.gz

Open the Ctl_rep1_R1_fastqc.html file in your browser.
What is the overall quality of the reads?
What is the GC content of the reads? Which species could these reads come from?
Are there any overrepresented sequences? What are they?

10.2 Map RNA-seq reads with `bowtie2`

Check the STAR help to understand the different options available.
What are the required options to run STAR?
Do we have the yeast genome available and indexed by STAR somewhere?

Answer

bash

STAR --help
ls Share/genome/R64-1-1*

Since the genome is not already indexed by STAR, generate it.

Answer

bash

mkdir -p genomes/R64-1-1/
STAR \
    --runMode genomeGenerate \
    --genomeDir genomes/R64-1-1/STAR/ \
    --genomeFastaFiles Share/genome/R64-1-1.fa \
    --sjdbGTFfile Share/genome/R64-1-1.gtf \
    --genomeSAindexNbases 10

Check the genomes/R64-1-1/STAR folder. What files are generated?
Run STAR to map the Ctl_rep1 against the yeast genome.
Use the --runThreadN option to speed up the mapping.

Answer

bash

STAR \
    --genomeDir genomes/R64-1-1/STAR/ \
    --readFilesIn Share/reads/Ctl_rep1_R1.fq.gz Share/reads/Ctl_rep1_R2.fq.gz \
    --readFilesCommand zcat \
    --runThreadN 2 \
    --outFileNamePrefix Ctl_rep1. \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped None \
    --outSAMattributes Standard

Check the mapping report generated by STAR.
How many reads were mapped? How many reads were discarded? What is the mapping rate?
How many reads were mapped to multiple locations?

Answer

bash

cat Ctl_rep1.Log.final.out

What is the output format of the mapped reads?
Check the content of the output file with samtools view. Don’t forget to check the -h option to include the header!
How many reads are there in the Ctl_rep1.Aligned.out.bam file?
How many reads map to each chromosome?

Answer

bash

ls -lh Ctl_rep1.Aligned.out.bam
samtools view -h Ctl_rep1.Aligned.out.bam | head -n 30
samtools view Ctl_rep1.Aligned.out.bam | cut -f 3 | sort | uniq -c

Get some statistics about the mapped reads with samtools stats. The summary numbers can be extracted with grep ^SN.
What is the average insert size? How is it different from the read size?

Answer

bash

samtools stats Ctl_rep1.Aligned.out.bam | grep ^SN

10.3 Generate RNA-seq track with `bamCoverage`

Check the bamCoverage help to understand the different options available.
What are the required options to run bamCoverage?
What is the output format of the generated track?
What is the default bin size?
What are the paired-end-specific options?

Answer

bash

bamCoverage --help

Run bamCoverage to generate a track from the Ctl_rep1 mapped reads.
Use the --binSize option to set the bin size to 10 bp.
Choose a suitable normalization method.

Answer

bash

samtools sort --write-index -o Ctl_rep1_sorted.bam Ctl_rep1.Aligned.out.bam
bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --numberOfProcessors 2 \
    --normalizeUsing CPM \
    --extendReads

Why should RNA-seq tracks be strand-oriented?
Run bamCoverage again to generate two stranded tracks. Use the --filterRNAstrand option to set the strand orientation.

Answer

bash

bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.fwd.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --normalizeUsing CPM \
    --extendReads \
    --filterRNAstrand forward
bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.rev.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --normalizeUsing CPM \
    --extendReads \
    --filterRNAstrand reverse

10.4 Visually inspect output tracks

Download the tracks and load them in IGV.
Compare unstranded and stranded tracks, and check the strand orientation of the reads compared to the gene annotations.

10.1 Checking quality of raw data with FastQC

10.2 Map RNA-seq reads with bowtie2

10.3 Generate RNA-seq track with bamCoverage

10.4 Visually inspect output tracks

10.1 Checking quality of raw data with `FastQC`

10.2 Map RNA-seq reads with `bowtie2`

10.3 Generate RNA-seq track with `bamCoverage`