Since its early ages, Bioconductor has gone very far. I’ve been on board for the past 7-ish years (I think), I use it every day, and every day, I think about how much I love it.
I love Bioconductor for many reasons, some of them being GenomicRanges
, SummarizedExperiment
, DESeq2
, and of course the amazing team.
But today, one reason which I want to discuss in more depth is the Annotation Hub
. This hub provides access to tons of resources, particularly if you work in human genomics. One reason why it’s so useful in human is because major consortia like Encode and Roadmap Epigenomics Project put their data there. So you end up with tons of stuff available, if you know what you are looking for!
The Roadmap Epigenomics Project (REP) has characterized chromatin states in tens of epigenomes (that’s literally the title of their paper: Integrative analysis of 111 reference human epigenomes!).
And guess what, all the results are available in the AnnotationHub
! Here is a glimpse at some of the data hosted there:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.3.3 ✔ purrr 0.3.4
## ✔ tibble 3.0.5 ✔ dplyr 1.0.3
## ✔ tidyr 1.1.2 ✔ stringr 1.4.0
## ✔ readr 1.4.0 ✔ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(AnnotationHub)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
AnnotationHub() ah <-
## using temporary cache /var/folders/qm/3bz0g1zn5wnb9ck9_lfwhpcm0000gn/T//RtmpGrlDwB/BiocFileCache
## snapshotDate(): 2020-10-27
query(ah, c("Homo sapiens", "EpigenomeRoadMap"))
ah_human_epigenome <-table(ah_human_epigenome$rdataclass)
##
## BigWigFile data.frame GRanges
## 9932 15 8301
head(query(ah_human_epigenome, 'BigWigFile')$title, 50)
## [1] "E001-H3K4me1.fc.signal.bigwig" "E001-H3K4me3.fc.signal.bigwig"
## [3] "E001-H3K9ac.fc.signal.bigwig" "E001-H3K9me3.fc.signal.bigwig"
## [5] "E001-H3K27me3.fc.signal.bigwig" "E001-H3K36me3.fc.signal.bigwig"
## [7] "E002-H3K4me1.fc.signal.bigwig" "E002-H3K4me3.fc.signal.bigwig"
## [9] "E002-H3K9ac.fc.signal.bigwig" "E002-H3K9me3.fc.signal.bigwig"
## [11] "E002-H3K27me3.fc.signal.bigwig" "E002-H3K36me3.fc.signal.bigwig"
## [13] "E003-DNase.fc.signal.bigwig" "E003-H2A.Z.fc.signal.bigwig"
## [15] "E003-H2AK5ac.fc.signal.bigwig" "E003-H2BK5ac.fc.signal.bigwig"
## [17] "E003-H2BK12ac.fc.signal.bigwig" "E003-H2BK15ac.fc.signal.bigwig"
## [19] "E003-H2BK20ac.fc.signal.bigwig" "E003-H2BK120ac.fc.signal.bigwig"
## [21] "E003-H3K4ac.fc.signal.bigwig" "E003-H3K4me1.fc.signal.bigwig"
## [23] "E003-H3K4me2.fc.signal.bigwig" "E003-H3K4me3.fc.signal.bigwig"
## [25] "E003-H3K9ac.fc.signal.bigwig" "E003-H3K9me3.fc.signal.bigwig"
## [27] "E003-H3K14ac.fc.signal.bigwig" "E003-H3K18ac.fc.signal.bigwig"
## [29] "E003-H3K23ac.fc.signal.bigwig" "E003-H3K23me2.fc.signal.bigwig"
## [31] "E003-H3K27ac.fc.signal.bigwig" "E003-H3K27me3.fc.signal.bigwig"
## [33] "E003-H3K36me3.fc.signal.bigwig" "E003-H3K56ac.fc.signal.bigwig"
## [35] "E003-H3K79me1.fc.signal.bigwig" "E003-H3K79me2.fc.signal.bigwig"
## [37] "E003-H4K5ac.fc.signal.bigwig" "E003-H4K8ac.fc.signal.bigwig"
## [39] "E003-H4K20me1.fc.signal.bigwig" "E003-H4K91ac.fc.signal.bigwig"
## [41] "E004-DNase.fc.signal.bigwig" "E004-H2AK5ac.fc.signal.bigwig"
## [43] "E004-H2BK5ac.fc.signal.bigwig" "E004-H2BK15ac.fc.signal.bigwig"
## [45] "E004-H2BK120ac.fc.signal.bigwig" "E004-H3K4ac.fc.signal.bigwig"
## [47] "E004-H3K4me1.fc.signal.bigwig" "E004-H3K4me2.fc.signal.bigwig"
## [49] "E004-H3K4me3.fc.signal.bigwig" "E004-H3K9ac.fc.signal.bigwig"
Wow, almost 10K of bigwig files. Neat!
What about metadata? It is available in the accession AH41830
.
ah_human_epigenome[['AH41830']] epigenome_metadata <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
1:10, 1:10] epigenome_metadata[
## EID GROUP COLOR MNEMONIC
## 1 E001 ESC #924965 ESC.I3
## 2 E002 ESC #924965 ESC.WA7
## 3 E003 ESC #924965 ESC.H1
## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4.MESO
## 5 E005 ES-deriv #4178AE ESDR.H1.BMP4.TROP
## 6 E006 ES-deriv #4178AE ESDR.H1.MSC
## 7 E007 ES-deriv #4178AE ESDR.H1.NEUR.PROG
## 8 E008 ESC #924965 ESC.H9
## 9 E009 ES-deriv #4178AE ESDR.H9.NEUR.PROG
## 10 E010 ES-deriv #4178AE ESDR.H9.NEUR
## STD_NAME
## 1 ES-I3 Cells
## 2 ES-WA7 Cells
## 3 H1 Cells
## 4 H1 BMP4 Derived Mesendoderm Cultured Cells
## 5 H1 BMP4 Derived Trophoblast Cultured Cells
## 6 H1 Derived Mesenchymal Stem Cells
## 7 H1 Derived Neuronal Progenitor Cultured Cells
## 8 H9 Cells
## 9 H9 Derived Neuronal Progenitor Cultured Cells
## 10 H9 Derived Neuron Cultured Cells
## EDACC_NAME ANATOMY TYPE AGE
## 1 ES-I3_Cell_Line ESC PrimaryCulture CL
## 2 ES-WA7_Cell_Line ESC PrimaryCulture CL
## 3 H1_Cell_Line ESC PrimaryCulture CL
## 4 H1_BMP4_Derived_Mesendoderm_Cultured_Cells ESC_DERIVED ESCDerived CL
## 5 H1_BMP4_Derived_Trophoblast_Cultured_Cells ESC_DERIVED ESCDerived CL
## 6 H1_Derived_Mesenchymal_Stem_Cells ESC_DERIVED ESCDerived CL
## 7 H1_Derived_Neuronal_Progenitor_Cultured_Cells ESC_DERIVED ESCDerived CL
## 8 H9_Cell_Line ESC PrimaryCulture CL
## 9 H9_Derived_Neuronal_Progenitor_Cultured_Cells ESC_DERIVED ESCDerived CL
## 10 H9_Derived_Neuron_Cultured_Cells ESC_DERIVED ESCDerived CL
## SEX
## 1 Female
## 2 Female
## 3 Male
## 4 Male
## 5 Male
## 6 Male
## 7 Male
## 8 Female
## 9 Female
## 10 Female
Cool. Next, we are going to focus on chromatin state segmentation. What’s available there?
query(ah_human_epigenome, c("chromhmmSegmentations", "cell line"))
## AnnotationHub with 5 records
## # snapshotDate(): 2020-10-27
## # $dataprovider: BroadInstitute
## # $species: Homo sapiens
## # $rdataclass: GRanges
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH46872"]]'
##
## title
## AH46872 | E017_15_coreMarks_mnemonics.bed.gz
## AH46967 | E114_15_coreMarks_mnemonics.bed.gz
## AH46968 | E115_15_coreMarks_mnemonics.bed.gz
## AH46970 | E117_15_coreMarks_mnemonics.bed.gz
## AH46971 | E118_15_coreMarks_mnemonics.bed.gz
Looks like 5 different epigenomes were segmented into chromatin states. The ones that we are interested in are those from H1 embryonic stem cells (H1) or from Neural Progenitor Cells (NPCs) derived from these H1 cells.
query(ah_human_epigenome, c("chromhmmSegmentations", "ESC.H1"))[[1]] H1_states <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## require("rtracklayer")
query(ah_human_epigenome, c("chromhmmSegmentations", "ESDR.H1.NEUR.PROG"))[[1]] NPC_states <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
resize(H1_states, 1, fix = 'center') # Resizing to make it easier to find overlaps
H1_states <- resize(NPC_states, 1, fix = 'center') # Resizing to make it easier to find overlaps NPC_states <-
Let’s find out how loci are changing between H1 and NPC cultures:
c(
states <-"Active TSS",
"Flanking Active TSS",
"Bivalent/Poised TSS",
"Enhancers",
"Flanking Bivalent TSS/Enh",
"Strong transcription",
"Transcr. at gene 5' and 3'",
"Weak transcription",
"Bivalent Enhancer",
"Genic enhancers",
"Quiescent/Low",
"Heterochromatin",
"Weak Repressed PolyComb",
"Repressed PolyComb",
"ZNF genes & repeats"
) data.frame(
df <-stateInH1 = factor(H1_states$name, levels = states),
stateInNPCs = factor(NPC_states[nearest(H1_states, NPC_states)]$name, levels = states)
)%>%
df group_by(stateInH1, stateInNPCs) %>%
summarize(count = n()) %>%
mutate(label = ifelse(count > 1000, count, '')) %>%
ggplot(aes(x = stateInH1, y = stateInNPCs, fill = count, label = label)) +
geom_tile(col = "black") +
geom_text(size = 1.5) +
theme_bw() +
scale_fill_gradientn(colours = c('white', 'orange', 'darkred')) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.ticks = element_blank()
+
) labs(
x = "Chromatin states in H1 cells",
y = "Chromatin states in H1-derived\nneural progenitor cells",
fill = '#'
+
) coord_fixed()
## `summarise()` has grouped output by 'stateInH1'. You can override using the `.groups` argument.
Using the tidyverse
to plot the transitions, it’s clear that while most of the segmented genomic loci keep their functional annotation from ESCs to NPCs, there are other loci which change from inactive to activate chromatin states, and vice versa.
For instance, let’s focus on the loci undergoing changes from a Repressed PolyComb
state to a Enhancers
state:
H1_states[df$stateInH1 == 'Repressed PolyComb' & df$stateInNPCs == 'Enhancers']
transitioning_loci <- transitioning_loci
## GRanges object with 4632 ranges and 4 metadata columns:
## seqnames ranges strand | abbr name color_name
## <Rle> <IRanges> <Rle> | <character> <character> <character>
## [1] chr10 552700 * | 13_ReprPC Repressed PolyComb Silver
## [2] chr10 654300 * | 13_ReprPC Repressed PolyComb Silver
## [3] chr10 1782600 * | 13_ReprPC Repressed PolyComb Silver
## [4] chr10 5725500 * | 13_ReprPC Repressed PolyComb Silver
## [5] chr10 6185700 * | 13_ReprPC Repressed PolyComb Silver
## ... ... ... ... . ... ... ...
## [4628] chrX 154027500 * | 13_ReprPC Repressed PolyComb Silver
## [4629] chrX 154029600 * | 13_ReprPC Repressed PolyComb Silver
## [4630] chrY 6758600 * | 13_ReprPC Repressed PolyComb Silver
## [4631] chrY 6761100 * | 13_ReprPC Repressed PolyComb Silver
## [4632] chrY 6765700 * | 13_ReprPC Repressed PolyComb Silver
## color_code
## <character>
## [1] #808080
## [2] #808080
## [3] #808080
## [4] #808080
## [5] #808080
## ... ...
## [4628] #808080
## [4629] #808080
## [4630] #808080
## [4631] #808080
## [4632] #808080
## -------
## seqinfo: 298 sequences (2 circular) from hg19 genome
Now, let’s fetch the genes associated to these loci. This is easily done, again, by retrieving informatin from AnnotationHub
!
I will retrieve the human gene model from AnnotationHub
:
query(ah, c("Homo sapiens", "release-75"))[[1]] human_gtf <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
seqlevelsStyle(human_gtf) <- "UCSC"
keepStandardChromosomes(human_gtf, pruning.mode = 'coarse')
human_gtf <- human_gtf[human_gtf$type == 'gene' & human_gtf$gene_biotype == 'protein_coding']
human_genes <- resize(human_genes, 1, 'start') human_TSSs <-
I can then retrieve the closest gene for each of the transitioning loci, and filter the genes that are close to these transitioning loci.
$nearest_gene <- human_TSSs[nearest(transitioning_loci, human_TSSs)]$gene_id
transitioning_loci$distance_to_nearest_gene <- mcols(distanceToNearest(transitioning_loci, human_TSSs))$distance
transitioning_loci unique(transitioning_loci$nearest_gene[transitioning_loci$distance_to_nearest_gene <= 1000]) genes <-
Finally, a quick GO over-representation analysis will reveal whether these few genes are indeed involved in any neural process. Once again, I can retrieve ontology from human OrgDb
using AnnotationHub
:
query(ah, c('homo sapiens', 'orgdb'))[[1]] org <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:AnnotationHub':
##
## cache
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:dplyr':
##
## select
clusterProfiler::enrichGO(
res <-
genes, OrgDb = org,
ont = 'CC',
keyType = 'ENSEMBL',
pvalueCutoff = 0.05,
qvalueCutoff = 0.05
)
##
@result %>% dplyr::select(Description, pvalue, p.adjust, Count) %>% arrange(p.adjust) %>% filter(p.adjust < 0.05) res
## Description pvalue p.adjust
## GO:0031256 leading edge membrane 5.725696e-05 0.01691181
## GO:0032590 dendrite membrane 1.218905e-04 0.01691181
## GO:0098978 glutamatergic synapse 1.666276e-04 0.01691181
## GO:0045211 postsynaptic membrane 2.081453e-04 0.01691181
## GO:0097060 synaptic membrane 3.054852e-04 0.01792994
## GO:0150034 distal axon 3.310143e-04 0.01792994
## GO:0044306 neuron projection terminus 4.010680e-04 0.01862101
## GO:0043195 terminal bouton 5.914433e-04 0.02143279
## GO:0099699 integral component of synaptic membrane 6.319324e-04 0.02143279
## GO:0099055 integral component of postsynaptic membrane 6.846992e-04 0.02143279
## GO:0043204 perikaryon 7.386127e-04 0.02143279
## GO:0032589 neuron projection membrane 7.959278e-04 0.02143279
## GO:0098936 intrinsic component of postsynaptic membrane 8.573117e-04 0.02143279
## GO:0043679 axon terminus 9.352729e-04 0.02156744
## GO:0099240 intrinsic component of synaptic membrane 9.954205e-04 0.02156744
## GO:1902711 GABA-A receptor complex 1.220128e-03 0.02434913
## GO:0034702 ion channel complex 1.275144e-03 0.02434913
## GO:0031253 cell projection membrane 1.404759e-03 0.02434913
## GO:1902710 GABA receptor complex 1.423488e-03 0.02434913
## GO:0031252 cell leading edge 1.948625e-03 0.03166515
## GO:1902495 transmembrane transporter complex 2.231752e-03 0.03453901
## GO:1990351 transporter complex 2.656914e-03 0.03924987
## Count
## GO:0031256 10
## GO:0032590 5
## GO:0098978 14
## GO:0045211 12
## GO:0097060 14
## GO:0150034 12
## GO:0044306 8
## GO:0043195 5
## GO:0099699 8
## GO:0099055 7
## GO:0043204 8
## GO:0032589 5
## GO:0098936 7
## GO:0043679 7
## GO:0099240 8
## GO:1902711 3
## GO:0034702 11
## GO:0031253 12
## GO:1902710 3
## GO:0031252 13
## GO:1902495 11
## GO:1990351 11
Clearly, some of the genes undergoing chromatin state transition are involved in neurone cell structures!