At any time, if you are lost or do not understand how functions in theproposed solution work, type ?<function> in the R console and a help menu will appear.
You can also check the help tab in the corresponding quadrant.
1.1 Set up your working environment
Let’s create a project for this course!
In RStudio: File > New Project..., create a project entitled “Bioc-workshop”.
Open the newly created project.
Download files required for the workshop from the following Google Drive Shared folder:
Now that we have two concordant objects, we would like to focus on sets of tissue-specific TSSs.
Question
Re-focus the genes to center them at their TSSs. Choose a window size that you find appropriate (hint: we’ll late focus on TATA box, so the window has to at least encompass it!)
To iterate over a set of values (e.g. a set of tissues…), one should favor lapply function rather than for loops. lapply outputs a list of elements, one for each element in the vector used in the input. Check ?lapply for more information if you are not familiar with it.
Question
Make a list of 6 elements, containing forward TSSs specific of each of the 5 main tissues, and the ubiquitous TSSs.
Read ?matchPattern for information on how to find a given sequence in a DNAStringSet. What is the difference between matchPattern() and vmatchPattern()?
Question
Is there a TATA box (“TATAAA”) in the first intestine TSS sequence?
Plot the distance between TATA box and intestine-specific TSSs.
Answer
library(ggplot2)df <-data.frame(pos = positions - WINDOW_SIZE/2)ggplot(df, aes(x = pos)) +geom_histogram(binwidth =10) +labs(x ="Distance to TSS", y ="# of motifs")
1.7 Function wrapping
We now know how to 1) parse GRanges, 2) map a chosen motif (“TATAAA”) over a set of sequences and 3) plot the distance between this motif and the center of sequences.
Question
Create a function which takes a set of sequences and a chosen motif as input, and returns a plot of the distance between
Answer
plotMotifDistance <-function(seqs, motif) { motif_occurences <-vmatchPattern(motif, seqs) positions <-unlist(startIndex(motif_occurences)) df <-data.frame(pos = positions - WINDOW_SIZE/2)ggplot(df, aes(x = pos)) +geom_histogram(binwidth =10) +labs(x ="Distance to TSS", y = glue::glue("# of {motif} motifs"), caption = glue::glue("{length(positions)} motifs found amongst {length(seqs)} sequences") )}plotMotifDistance(intest_TSS_seqs, "TATAAA")
Question
Add an argument to precise number of possible mismatches when looking for the motif. Also adapt the vmatchPattern() function!
The real TATA box consensus sequence is closer to “TATAWAA”. Can you further adapt vmatchPattern() function so it can accept all IUPAC code (e.g. W = A/T).