11  Lab 2: Quantification and differential gene expression by RNA-seq

Aims
  • Estimate transcript abundance from RNA-seq samples
  • Import count data in R
  • Perform differential expression analysis with DESeq2
Datasets

RNA-seq data was published in Nuño-Cabanes et al., Scientific Data 2020

  • Control RNA-seq @ 30 C:

    • SRR9929263
    • SRR9929264
    • SRR9929273
    • SRR9929282
  • Heat shock RNA-seq @ 39 C, 20 min: SRR2045248 & SRR2045249

    • SRR9929271
    • SRR9929265
    • SRR9929280
    • SRR9929274

11.1 Count mapped fragments over a set of gene annotations

  • Check the files in the Share/mapping directory.
bash
ls -l Share/mapping/RNAseq*bam
  • Locate the gene annotation file provided for the yeast genome.
  • How many annotations are there in the file?
  • What type of annotations are there?
  • How can you have more start codons than stop codons?
bash
wc -l Share/genome/R64-1-1.gtf 
head Share/genome/R64-1-1.gtf
cut -f3 Share/genome/R64-1-1.gtf | sort | uniq -c
  • Check featureCounts help to understand the different options available.
  • Which level of annotations should you use to count the number of mapped fragments over?
  • What are the featureCounts options specific to paired-end reads? And to stranded RNA-seq?
  • Use featureCounts to count the number of mapped fragments over the set of gene annotations.
bash
featureCounts --help
featureCounts \
    -a Share/genome/R64-1-1.gtf \
    -o RNAseq_counts.txt \
    -t transcript \
    -g gene_name \
    -s 2 \
    -p \
    --countReadPairs \
    -P -B -D 10000 -C \
    -T 2 \
    Share/mapping/RNAseq*R64-1-1.bam

11.2 Import counts into R

  • Create a tibble containing the following information for the four RNA-seq samples:

    • Path to local bam file
    • The biological sample it corresponds to (e.g. WT, HS)
    • The biological replicate it corresponds to (e.g. rep1, rep2, …)
R
library(tidyverse)
samples <- tibble(
    bam = list.files('Share/mapping', pattern = 'RNAseq.*_R64-1-1.bam$', full.names = TRUE), 
    sample = basename(bam), 
) |> separate(sample, sep = '_', into = c(NA, 'condition', 'rep', NA)) |> 
    mutate(sample = paste(condition, rep, sep = '_'))
  • Import gene annotations for yeast.
  • Filter it out to only keep transcripts
R
library(rtracklayer)
genes <- import('Share/genome/R64-1-1.gtf')
genes <- genes[genes$type == 'transcript']
names(genes) <- genes$gene_name
  • Import the counts generated by featureCounts into R.
R
cnts <- read_tsv('RNAseq_counts.txt', col_names = TRUE, comment = "#")
cnts <- cnts |> 
    select(-c(2:6)) |> 
    column_to_rownames('Geneid') |> 
    as.matrix() 
colnames(cnts) <- samples$sample
  • Wrap the counts, the sample metadata and the gene annotations into a SummarizedExperiment object.
R
library(SummarizedExperiment)
se <- SummarizedExperiment(
    assays = list(counts = cnts), 
    rowRanges = genes, 
    colData = samples
)

11.3 Differential expression analysis

  • Read DESeq documentation (?DESeq). What type of object does it require? How can you create one?
  • Create a DESeqDataSet object from the RNA counts. Choose your design formula appropriately.
  • Run DESeq workflow
R
library(DESeq2)
dds <- DESeqDataSet(se, design = ~ condition)
dds <- DESeq(dds)
  • Check the documentation for the results function from DESeq2 package.
  • Extract the results of differential expression analysis, for the appropriate contrast.
  • Recover genes over-expressed in heat-shock vs. control growth (fold-change >= 2, p-value <= 0.01)
R
res <- results(dds, contrast = c('condition', 'HS', 'WT'))
upreg_genes <- res |> 
    as_tibble(rownames = 'gene') |> 
    filter(log2FoldChange > 1, padj <= 0.001) |> 
    pull(gene) |> writeLines()
Back to top