10  Lab 1: Hands-on processing of RNA-seq

Aims
  • Perform raw data QC with FastQC
  • Manually process RNA-seq reads
  • Generate stranded RNA-seq tracks with bamCoverage
  • Inspect tracks with IGV
Datasets

We will use the SRR9929264 RNA-seq data published in Nuño-Cabanes et al., Scientific Data 2020

10.1 Checking quality of raw data with FastQC

fastq files have already been downloaded from SRA, using the following command:

bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR992/004/SRR9929264/SRR9929264_1.fastq.gz -o Share/reads/Ctl_rep1_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR992/004/SRR9929264/SRR9929264_2.fastq.gz -o Share/reads/Ctl_rep1_R2.fq.gz
  • Checkout the Share/reads folder where the reads are stored (for Ctl_rep1 at least).
  • What type of sequencing data (paired-end, single-end) is it?
  • How many reads are there in each file?
  • How long are the reads?
bash
zcat Share/reads/Ctl_rep1_R1.fq.gz | wc -l
zcat Share/reads/Ctl_rep1_R1.fq.gz | head -n 8
  • Check FastQC help to understand the different options available.
  • Run FastQC to generate a QC report of the quality of raw data. Think about the Share/adapters.txt file that you may provide as an option…
bash
fastqc --help
mkdir qc/
fastqc --outdir qc/ \
    --noextract \
    --threads 2 \
    --adapters Share/adapters.txt \
    Share/reads/Ctl_rep1_R1.fq.gz
  • Open the Ctl_rep1_R1_fastqc.html file in your browser.
  • What is the overall quality of the reads?
  • What is the GC content of the reads? Which species could these reads come from?
  • Are there any overrepresented sequences? What are they?

10.2 Map RNA-seq reads with bowtie2

  • Check the STAR help to understand the different options available.
  • What are the required options to run STAR?
  • Do we have the yeast genome available and indexed by STAR somewhere?
bash
STAR --help
ls Share/genome/R64-1-1*
  • Since the genome is not already indexed by STAR, generate it.
bash
mkdir -p genomes/R64-1-1/
STAR \
    --runMode genomeGenerate \
    --genomeDir genomes/R64-1-1/STAR/ \
    --genomeFastaFiles Share/genome/R64-1-1.fa \
    --sjdbGTFfile Share/genome/R64-1-1.gtf \
    --genomeSAindexNbases 10
  • Check the genomes/R64-1-1/STAR folder. What files are generated?

  • Run STAR to map the Ctl_rep1 against the yeast genome.

  • Use the --runThreadN option to speed up the mapping.

bash
STAR \
    --genomeDir genomes/R64-1-1/STAR/ \
    --readFilesIn Share/reads/Ctl_rep1_R1.fq.gz Share/reads/Ctl_rep1_R2.fq.gz \
    --readFilesCommand zcat \
    --runThreadN 2 \
    --outFileNamePrefix Ctl_rep1. \
    --outSAMtype BAM Unsorted \
    --outSAMunmapped None \
    --outSAMattributes Standard
  • Check the mapping report generated by STAR.
  • How many reads were mapped? How many reads were discarded? What is the mapping rate?
  • How many reads were mapped to multiple locations?
bash
cat Ctl_rep1.Log.final.out
  • What is the output format of the mapped reads?
  • Check the content of the output file with samtools view. Don’t forget to check the -h option to include the header!
  • How many reads are there in the Ctl_rep1.Aligned.out.bam file?
  • How many reads map to each chromosome?
bash
ls -lh Ctl_rep1.Aligned.out.bam
samtools view -h Ctl_rep1.Aligned.out.bam | head -n 30
samtools view Ctl_rep1.Aligned.out.bam | cut -f 3 | sort | uniq -c
  • Get some statistics about the mapped reads with samtools stats. The summary numbers can be extracted with grep ^SN.
  • What is the average insert size? How is it different from the read size?
bash
samtools stats Ctl_rep1.Aligned.out.bam | grep ^SN

10.3 Generate RNA-seq track with bamCoverage

  • Check the bamCoverage help to understand the different options available.
  • What are the required options to run bamCoverage?
  • What is the output format of the generated track?
  • What is the default bin size?
  • What are the paired-end-specific options?
bash
bamCoverage --help
  • Run bamCoverage to generate a track from the Ctl_rep1 mapped reads.
  • Use the --binSize option to set the bin size to 10 bp.
  • Choose a suitable normalization method.
bash
samtools sort --write-index -o Ctl_rep1_sorted.bam Ctl_rep1.Aligned.out.bam
bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --numberOfProcessors 2 \
    --normalizeUsing CPM \
    --extendReads
  • Why should RNA-seq tracks be strand-oriented?
  • Run bamCoverage again to generate two stranded tracks. Use the --filterRNAstrand option to set the strand orientation.
bash
bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.fwd.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --normalizeUsing CPM \
    --extendReads \
    --filterRNAstrand forward
bamCoverage \
    --bam Ctl_rep1_sorted.bam \
    --outFileName Ctl_rep1.rev.bw \
    --outFileFormat bigwig \
    --binSize 10 \
    --normalizeUsing CPM \
    --extendReads \
    --filterRNAstrand reverse

10.4 Visually inspect output tracks

  • Download the tracks and load them in IGV.
  • Compare unstranded and stranded tracks, and check the strand orientation of the reads compared to the gene annotations.
Back to top