Short DNA sequence motifs provide key information for interpreting the instructions in DNA, for example by providing binding sites for proteins or altering the structure of the double-helix. A less studied but important feature of DNA sequence motifs is their periodicity. A famous example is the 10-bp periodicity of many k-mers in nucleosome positioning (reviewed in Travers et al. 2010 or in Struhl and Segal 2013).
periodicDNA provides a framework to quantify the periodicity of k-mers of interest in DNA sequences. For a chosen k-mer, periodicDNA can identify which periods are statistically enriched in a set of sequences, by using a randomized shuffling approach to compute an empirical p-value. It can also generate continuous linear tracks of k-mer periodicity strength over genomic loci.
To estimate the periodicity strength of a given k-mer in one or several sequences, periodicDNA performs the following steps:
The main goal of periodicDNA is to quantify periodicity of a given k-mer in a set of sequences. For instance, one can assess the periodicity of TT dinucleotides in sequences around TSSs of ubiquitous promoters using getPeriodicity()
.
In the following example, getPeriodicity()
is directly ran using a GRanges object, specifying from which genome this GRanges comes from.
library(ggplot2)
library(magrittr)
library(periodicDNA)
#
data(ce11_TSSs)
periodicity_result <- getPeriodicity(
ce11_TSSs[['Ubiq.']][1:500],
genome = 'BSgenome.Celegans.UCSC.ce11',
motif = 'TT',
BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
The main output of getPeriodicity()
is a table of power spectral density (PSD) values associated with discrete frequencies, computed using a Fast Fourier Transform. For a given frequency, a high PSD score indicates a high periodicity of the k-mer of interest.
In the previous example, TT dinucleotides in sequences around TSSs of ubiquitous promoters are highly periodic, with a periodicity of 10 bp.
head(periodicity_result$PSD)
#> freq period PSD
#> 1 0.005 200.00000 6.256976e-08
#> 2 0.010 100.00000 2.204282e-08
#> 3 0.015 66.66667 2.215522e-09
#> 4 0.020 50.00000 1.108237e-08
#> 5 0.025 40.00000 4.649689e-09
#> 6 0.030 33.33333 2.661198e-08
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD
#> 20 0.1 10 3.633071e-06
Graphical output of getPeriodicity()
can be obtained using the plotPeriodicityResults()
function:
plotPeriodicityResults(periodicity_result)
The first plot shows the raw distribution of distances between pairs of ‘TT’ in the sequences of the provided GRanges. The second plot shows the decay-normalised distribution. Finally, the third plot shows the PSD scores of the ‘TT’ k-mer, measured from the normalised distribution.
periodicDNA provides an approach to compare the periodicity of a given k-mer in a set of sequences compared to background. For a given k-mer at a period T in a set of input sequences, the fold-change over background of its PSD is estimated by iteratively shuffling the input sequences and estimating the resulting PSD values.
Eventually, the log2 fold-change (l2FC) between the observed PSD and the median of the PSD values measured after shuffling is computed as follows:
l2FC = log2(observed PSD / median(shuffled PSDs)).
periodicity_result <- getPeriodicity(
ce11_TSSs[['Ubiq.']][1:500],
genome = 'BSgenome.Celegans.UCSC.ce11',
motif = 'TT',
n_shuffling = 5
)
#> - Calculating observed PSD
#> - Mapping k-mers.
#> - 523903 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
#> - Shuffling 1/5
#> - Shuffling 2/5
#> - Shuffling 3/5
#> - Shuffling 4/5
#> - Shuffling 5/5
#> Only 5 shufflings. Cannot compute accurate empirical p-values. To compute empirical p-values, set up n_shuffling to at least 100. Only l2FC values are returned
head(periodicity_result$periodicityMetrics)
#> Freq Period PSD_observed l2FC pval fdr
#> 1 0.005 200.00000 6.26e-08 -1.18584659 NA NA
#> 2 0.010 100.00000 2.20e-08 -0.06313841 NA NA
#> 3 0.015 66.66667 2.22e-09 -2.30747879 NA NA
#> 4 0.020 50.00000 1.11e-08 0.78316580 NA NA
#> 5 0.025 40.00000 4.65e-09 -1.29262003 NA NA
#> 6 0.030 33.33333 2.66e-08 2.09084878 NA NA
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD_observed l2FC pval fdr
#> 20 0.1 10 3.63e-06 10.33038 NA NA
plotPeriodicityResults(periodicity_result)
If n_shuffling >= 100
, an associated empirical p-value is calculated as well (North et al 2002). This metric indicates, for each individual period T, whether the observed PSD is significantly greater than the PSD values measured after shuffling the input sequences. Note that empirical p-values are only an estimation of the real p-value. Notably, small p-values are systematically under-estimated (North et al 2002).
getPeriodicity()
can also be ran directly on a set of sequences of interest as follows:
data(ce11_proms_seqs)
periodicity_result <- getPeriodicity(
ce11_proms_seqs,
motif = 'TT',
BPPARAM = setUpBPPARAM(1)
)
#> - Mapping k-mers.
#> - 117630 pairwise distances measured.
#> - Calculating pairwise distance distribution.
#> - Normalizing distogram vector.
#> - Applying Fast Fourier Transform to the normalized distogram.
subset(periodicity_result$periodicityMetrics, Period == 10)
#> Freq Period PSD
#> 20 0.1 10 1.16806e-06
The other aim of periodicDNA is to generate continuous linear tracks of k-mer periodicity strength over genomic loci of interest. getPeriodicityTrack()
can be used to generate suck genomic tracks. In the following example,
WW_10bp_track <- getPeriodicityTrack(
genome = 'BSgenome.Celegans.UCSC.ce11',
granges = ce11_proms,
motif = 'WW',
period = 10,
BPPARAM = setUpBPPARAM(1),
bw_file = 'WW-10-bp-periodicity_over-proms.bw'
)
When plotted over sets of ubiquitous, germline or somatic TSSs, the resulting track clearly shows increase of WW 10-bp periodicity above the ubiquitous and germline TSSs, whereas somatic TSSs do not show such increase.
data(ce11_TSSs)
plotAggregateCoverage(
WW_10bp_track,
ce11_TSSs,
xlab = 'Distance from TSS',
ylab = '10-bp periodicity strength (forward proms.)'
)
sessionInfo()
#> R version 4.0.5 (2021-03-31)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Catalina 10.15.7
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] BSgenome.Celegans.UCSC.ce11_1.4.2 periodicDNA_1.1.1
#> [3] BiocParallel_1.24.1 BSgenome_1.58.0
#> [5] rtracklayer_1.50.0 Biostrings_2.58.0
#> [7] XVector_0.30.0 magrittr_2.0.1
#> [9] ggplot2_3.3.3 GenomicRanges_1.42.0
#> [11] GenomeInfoDb_1.26.7 IRanges_2.24.1
#> [13] S4Vectors_0.28.1 BiocGenerics_0.36.1
#>
#> loaded via a namespace (and not attached):
#> [1] lattice_0.20-41 Rsamtools_2.6.0
#> [3] zoo_1.8-9 rprojroot_2.0.2
#> [5] digest_0.6.27 utf8_1.2.1
#> [7] R6_2.5.0 evaluate_0.14
#> [9] highr_0.9 pillar_1.6.0
#> [11] zlibbioc_1.36.0 rlang_0.4.10
#> [13] Matrix_1.3-2 rmarkdown_2.7
#> [15] pkgdown_1.6.1 labeling_0.4.2
#> [17] textshaping_0.3.3 desc_1.3.0
#> [19] stringr_1.4.0 RCurl_1.98-1.3
#> [21] munsell_0.5.0 DelayedArray_0.16.3
#> [23] compiler_4.0.5 xfun_0.22
#> [25] pkgconfig_2.0.3 systemfonts_1.0.1
#> [27] htmltools_0.5.1.1 SummarizedExperiment_1.20.0
#> [29] tibble_3.1.1 GenomeInfoDbData_1.2.4
#> [31] matrixStats_0.58.0 XML_3.99-0.6
#> [33] fansi_0.4.2 crayon_1.4.1
#> [35] withr_2.4.2 GenomicAlignments_1.26.0
#> [37] bitops_1.0-6 grid_4.0.5
#> [39] gtable_0.3.0 lifecycle_1.0.0
#> [41] scales_1.1.1 stringi_1.5.3
#> [43] cachem_1.0.4 farver_2.1.0
#> [45] fs_1.5.0 ellipsis_0.3.1
#> [47] ragg_1.1.2 vctrs_0.3.7
#> [49] cowplot_1.1.1 tools_4.0.5
#> [51] Biobase_2.50.0 glue_1.4.2
#> [53] MatrixGenerics_1.2.1 fastmap_1.1.0
#> [55] yaml_2.2.1 colorspace_2.0-0
#> [57] memoise_2.0.0 knitr_1.32