8  Data gateways: accessing public Hi-C data portals

Aims

This chapter focuses on introducing two important portals hosting public Hi-C datasets: the 4DN Consortium and the DNA Zoo project. Two R packages provide a programmatic access to these portals:

  • fourDNData
  • DNAZooData

The Hi-C experimental approach has gained significant traction across multiple fields related to genome biology, and several consortia developed large-scale programs based on this technique. The fourDNData and DNAZooData R packages were designed to accelerate the investigation of chromatin structure using these public resources.

8.1 4DN data portal

The 4D Nucleome Data Coordination and Integration Center (DCIC) has developed and actively maintains a data portal providing public access to a wealth of resources to investigate 3D chromatin architecture. Notably, 3D chromatin conformation libraries relying on different technologies (“in situ” or “dilution” Hi-C, Capture Hi-C, Micro-C, DNase Hi-C, …), generated by 50+ collaborating labs, were homogenously processed, yielding more than 350 sets of processed files.

fourDNData (read 4DN-Data) is a package giving programmatic access to these uniformly processed Hi-C contact files.

The fourDNData() function provides a gateway to 4DN-hosted Hi-C files, including contact matrices (in .hic or .mcool) and other Hi-C derived files such as annotated compartments, domains, insulation scores, or .pairs files.

library(fourDNData)
head(fourDNData())
##  experimentSetAccession     fileType     size organism experimentType details                              dataset                                                       condition               biosource biosourceType             publication                                                                                                                                 URL
##            4DNES18BMU79        pairs 10151.53    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019) https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/49504f97-904e-48c1-8c20-1033680b66da/4DNFIC5AHBPV.pairs.gz
##            4DNES18BMU79          hic  5285.82    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019)      https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/6cd4378a-8f51-4e65-99eb-15f5c80abf8d/4DNFIT4I5C6Z.hic
##            4DNES18BMU79        mcool  6110.75    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019)    https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/01fb704f-2fd7-48c6-91af-c5f4584529ed/4DNFIVPAXJO8.mcool
##            4DNES18BMU79   boundaries     0.12    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019)   https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/5c07cdee-53e2-43e0-8853-cfe5f057b3f1/4DNFIR3XCIMA.bed.gz
##            4DNES18BMU79   insulation     7.18    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019)       https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/d1f4beb9-701f-4188-abe2-6271fe658770/4DNFIXKKNMS7.bw
##            4DNES18BMU79 compartments     0.18    mouse   in situ Hi-C   DpnII Hi-C on Mouse Olfactory System cells Mature olfactory sensory neurons with conditional Ldb1 knockout olfactory receptor cell  primary cell Monahan K et al. (2019)       https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/3d429647-51c8-4e3a-a18b-eec0b1480905/4DNFIN13N8C1.bw

8.1.1 Querying individual files

The fourDNData() function can be used to directly fetch specific files from the 4DN data portal:

cf <- fourDNData(experimentSetAccession = '4DNES25ABNZ1', type = 'mcool')
##  |===================================|  100%

This effectively downloads and caches the queried file locally.

cf
## [1] "/home/rsg/.cache/R/fourDNData/698470d302f0_4DNFI4988896.mcool"

availableChromosomes(cf)
## Seqinfo object with 24 sequences from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   chr1      248956422       <NA>   <NA>
##   chr2      242193529       <NA>   <NA>
##   chr3      198295559       <NA>   <NA>
##   chr4      190214555       <NA>   <NA>
##   chr5      181538259       <NA>   <NA>
##   ...             ...        ...    ...
##   chr20      64444167       <NA>   <NA>
##   chr21      46709983       <NA>   <NA>
##   chr22      50818468       <NA>   <NA>
##   chrX      156040895       <NA>   <NA>
##   chrY       57227415       <NA>   <NA>

availableResolutions(cf)
## resolutions(13): 1000 2000 ... 5000000 10000000

import(cf, focus = "chr4:10000001-20000000", resolution = 5000)
## `HiCExperiment` object with 14,682 contacts over 2,000 regions
## -------
## fileName: "/home/rsg/.cache/R/fourDNData/29051ff3104c_4DNFINSF15ZM.mcool"
## focus: "chr4:10,000,001-20,000,000"
## resolutions(13): 1000 2000 ... 5000000 10000000
## active resolution: 5000
## interactions: 12016
## scores(2): count balanced
## topologicalFeatures: compartments(0) borders(0) loops(0) viewpoints(0)
## pairsFile: N/A
## metadata(0):

Different Hi-C related genomic files are provided by the 4DN consortium. The type of file to fetch can be specified with the type argument:

  • type = 'pairs' will fetch the pairs file which was generated by the 4DN consortium and binned into a contact matrix. Once fetched from the 4DN data portal, the local file can be imported in R using the import function, which will generate a GInteractions object.
fourDNData(experimentSetAccession = '4DNES25ABNZ1', type = 'pairs') |> 
    import()
## GInteractions object with 13821669 interactions and 3 metadata columns:
##              seqnames1   ranges1     seqnames2   ranges2 |       frag1     frag2  distance
##                  <Rle> <IRanges>         <Rle> <IRanges> | <character> <numeric> <integer>
##          [1]      chr1   3000003 ---      chr1  88307603 |          UU        22  85307600
##          [2]      chr1   3000022 ---      chr1  28227919 |          UU        50  25227897
##          [3]      chr1   3000023 ---      chr1  50187758 |          RU        35  47187735
##          [4]      chr1   3000024 ---      chr1   4090828 |          RU         9   1090804
##          [5]      chr1   3000024 ---      chr1  35080614 |          UU         3  32080590
##          ...       ...       ... ...       ...       ... .         ...       ...       ...
##   [13821665]      chr1  24472292 ---      chr1  24472986 |          UU        60       694
##   [13821666]      chr1  24472292 ---      chr1  24805552 |          RU        60    333260
##   [13821667]      chr1  24472292 ---      chr1  24874144 |          UU        60    401852
##   [13821668]      chr1  24472294 ---      chr1 115668400 |          UU        60  91196106
##   [13821669]      chr1  24472295 ---      chr1  43307467 |          UU        60  18835172
Watch out

.pairs files can be particularly large and therefore will take both and long time to download and a large storage footprint.

  • type = 'insulation' will fetch a .bigwig track file precomputed by the 4DN consortium. This track corresponds to the genome-wide insulation score computed by cooltools as described in Crane et al. (2015). To know more about this, read the excerpt from 4DN data portal. Once fetched from the 4DN data portal, the local file can be imported in R using the import function, which will generate a RleList object.
fourDNData(experimentSetAccession = '4DNES25ABNZ1', type = 'insulation') |> 
    import(as = 'Rle')
##  |===================================|  100%
##  RleList of length 21
##  $chr1
##  numeric-Rle of length 195471971 with 38145 runs
##    Lengths:      3065000         5000 ...         5000       171971
##    Values :  0.00000e+00  1.01441e-01 ...     0.807009     0.000000
##  
##  $chr10
##  numeric-Rle of length 130694993 with 25100 runs
##    Lengths:     3175000        5000        5000 ...        5000      169993
##    Values :  0.00000000  0.37584546  0.33597839 ...    0.628601    0.000000
##  
##  $chr11
##  numeric-Rle of length 122082543 with 23536 runs
##    Lengths:    3165000       5000       5000 ...       5000     162543
##    Values :  0.0000000 -0.7906257 -0.7930040 ...   0.515919   0.000000
##
##  ...
  • type = 'boundaries' will fetch a .bed file precomputed by the 4DN consortium, listing the annotated borders between topological domains. These borders correspond to local minima identified from the genome-wide insulation track. It can also be imported in R using the import function, which will generate a GRanges object.
fourDNData(experimentSetAccession = '4DNES25ABNZ1', type = 'boundaries') |> 
    import()
##  |===================================|  100%
##  GRanges object with 6103 ranges and 2 metadata columns:
##           seqnames            ranges strand |        name     score
##              <Rle>         <IRanges>  <Rle> | <character> <numeric>
##       [1]     chr1   4380001-4385000      * |      Strong  0.695274
##       [2]     chr1   4760001-4765000      * |        Weak  0.444476
##       [3]     chr1   4910001-4915000      * |        Weak  0.353184
##       [4]     chr1   5180001-5185000      * |      Strong  0.565763
##       [5]     chr1   6170001-6175000      * |      Strong  1.644911
##       ...      ...               ...    ... .         ...       ...
##    [6099]     chrY 89725001-89730000      * |        Weak  0.258094
##    [6100]     chrY 89790001-89795000      * |        Weak  0.442186
##    [6101]     chrY 89895001-89900000      * |        Weak  0.279879
##    [6102]     chrY 90025001-90030000      * |      Strong  0.660699
##    [6103]     chrY 90410001-90415000      * |      Strong  1.160018
  • type = 'compartments' will fetch a .bigwig track file precomputed by the 4DN consortium. This track corresponds to the selected genome-wide eigenvector computed by cooltools and representing A/B compartments. To know more about this, read the excerpt from 4DN data portal. Once fetched from the 4DN data portal, the local file can be imported in R using the import function, which will generate a RleList object. The score represents the eigenvector values, and by convention a genomic bin with a positive score is associated with the A compartment whereas a genomic bin with a negative score is associated with the B compartment.
fourDNData(experimentSetAccession = '4DNES25ABNZ1', type = 'compartments') |> 
    import()
##  |===================================|  100%
##  RleList of length 21
##  $chr1
##  numeric-Rle of length 195471971 with 771 runs
##    Lengths:     3000000      250000      250000 ...      250000      221971
##    Values :         NaN -0.83457172 -0.98202854 ...  0.45792237         NaN
##  
##  $chr10
##  numeric-Rle of length 130694993 with 512 runs
##    Lengths:     3000000      250000      250000 ...      250000      194993
##    Values :         NaN -0.99524581 -0.76405841 ...   0.0583894         NaN
##  
##  $chr11
##  numeric-Rle of length 122082543 with 478 runs
##    Lengths:     3000000      250000      250000 ...      250000       82543
##    Values :         NaN -0.00653325  0.26659977 ...  0.25900587         NaN

8.1.2 Querying a complete experiment dataset

Rather than importing multiple files corresponding to a single experimentSet accession ID one by one, one can import all the available files associated with a experimentSet accession ID into a HiCExperiment object by using the fourDNHiCExperiment() function.

hic <- fourDNHiCExperiment('4DNESSS7VU57')
## Fetching Hi-C contact map from 4DN portal
##   |===================================================================| 100%
## 
## Compartments not found for the provided experimentSet accession.
## Fetching insulation bigwig file from 4DN portal
##   |===================================================================| 100%
## 
## Fetching borders bed file from 4DN portal
##   |===================================================================| 100%

This is a more efficient way to import datasets, as it aggregates the different bits together into a single HiCExperiment object with populated topologicalFeatures and metadata slots.

hic
## `HiCExperiment` object with 544,370,135 contacts over 286 regions
## -------
## fileName: "/home/rsg/.cache/R/fourDNData/392eba3a587_4DNFIZ59STGB.mcool"
## focus: "whole genome"
## resolutions(13): 1000 2000 ... 5000000 10000000
## active resolution: 10000000
## interactions: 40088
## scores(2): count balanced
## topologicalFeatures: compartments(0) borders(7887)
## pairsFile: N/A
## metadata(2): 4DN_info diamond_insulation
metadata(hic)
##  $`4DN_info`
##  experimentSetAccession   fileType    size organism experimentType        details                             dataset               condition      biosource biosourceType                publication  URL
##            4DNESSS7VU57      pairs 9731.58    mouse   in situ Hi-C Arima - A1, A2 Hi-C on mouse somatic gonadal cells  female granulosa cells granulosa cell  primary cell  Lindeman RE et al. (2021)  https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/b7da6d89-9e24-48a7-b3a6-f30a49c843e3/4DNFI2PHVZ5S.pairs.gz
##            4DNESSS7VU57        hic 4160.17    mouse   in situ Hi-C Arima - A1, A2 Hi-C on mouse somatic gonadal cells  female granulosa cells granulosa cell  primary cell  Lindeman RE et al. (2021)  https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/327f091d-6a63-47c4-9752-2dff303a13d9/4DNFI6GFHB6G.hic
##            4DNESSS7VU57      mcool 2863.90    mouse   in situ Hi-C Arima - A1, A2 Hi-C on mouse somatic gonadal cells  female granulosa cells granulosa cell  primary cell  Lindeman RE et al. (2021)  https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/2fed1cf8-b334-4165-a32f-df3f9ae4d6d7/4DNFIZ59STGB.mcool
##            4DNESSS7VU57 insulation    7.25    mouse   in situ Hi-C Arima - A1, A2 Hi-C on mouse somatic gonadal cells  female granulosa cells granulosa cell  primary cell  Lindeman RE et al. (2021)  https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/88e1d2ad-4d59-4c6f-9793-4fd8afc74762/4DNFI65DQZJ7.bw
##            4DNESSS7VU57 boundaries    0.12    mouse   in situ Hi-C Arima - A1, A2 Hi-C on mouse somatic gonadal cells  female granulosa cells granulosa cell  primary cell  Lindeman RE et al. (2021)  https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/2a29eec0-f551-4d50-8c0a-f4d4c2acd0db/4DNFIV519AWN.bed.gz
##  
##  $diamond_insulation
##  RleList of length 20
##      $chr1
##      numeric-Rle of length 195471971 with 37959 runs
##        Lengths:    3085000       5000       5000 ...       5000     186971
##        Values :  0.0000000  0.3967191  0.3961740 ...  0.8223819  0.0000000
##      
##      $chr10
##      numeric-Rle of length 130694993 with 24994 runs
##        Lengths:      3180000         5000 ...         5000       179993
##        Values :  0.00000e+00  5.35871e-01 ...   0.60626638   0.00000000
## 
##      ...

8.2 DNA Zoo

The DNA Zoo Consortium is a collaborative group whose aim is to correct and refine genome assemblies across the tree of life using Hi-C approaches. As of 2023, they have performed Hi-C across more than 300 animal, plant and fungi species.

DNAZooData is a package giving programmatic access to these uniformly processed Hi-C contact files, as well as the refined genome assemblies.

The DNAZooData() function provides a gateway to DNA Zoo-hosted Hi-C files, fetching and caching relevant contact matrices in .hic format It returns a HicFile object, which can then be imported in memory using import().

library(DNAZooData)
head(DNAZooData())
##                  species                              readme                                                         readme_link original_assembly     new_assembly                                                             new_assembly_link new_assembly_link_status                                                                 hic_link
##         Acinonyx_jubatus        Acinonyx_jubatus/README.json        https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/README.json           aciJub1      aciJub1_HiC         https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/aciJub1_HiC.fasta.gz                      200    https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/aciJub1.rawchrom.hic
##       Acropora_millepora      Acropora_millepora/README.json      https://dnazoo.s3.wasabisys.com/Acropora_millepora/README.json       amil_sf_1.1  amil_sf_1.1_HiC   https://dnazoo.s3.wasabisys.com/Acropora_millepora/amil_sf_1.1_HiC.fasta.gz                      200   https://dnazoo.s3.wasabisys.com/Acropora_millepora/amil_sf_1.1_HiC.hic
##      Addax_nasomaculatus     Addax_nasomaculatus/README.json     https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/README.json      ASM1959352v1 ASM1959352v1_HiC https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/ASM1959352v1_HiC.fasta.gz                      200 https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/ASM1959352v1_HiC.hic
##            Aedes_aegypti           Aedes_aegypti/README.json           https://dnazoo.s3.wasabisys.com/Aedes_aegypti/README.json        AGWG.draft         AaegL5.0               https://dnazoo.s3.wasabisys.com/Aedes_aegypti/AaegL5.0.fasta.gz                      404                                                                     <NA>
##    Aedes_aegypti__AaegL4   Aedes_aegypti__AaegL4/README.json   https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/README.json            AaegL3           AaegL4         https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/AaegL4.fasta.gz                      200         https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/AaegL4.hic
##  Aedes_aegypti__AaegL5.0 Aedes_aegypti__AaegL5.0/README.json https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/README.json        AGWG.draft         AaegL5.0     https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/AaegL5.0.fasta.gz                      200     https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/AaegL5.0.hic

For example, we can directly fetch a Hi-C dataset generated from a tardigrade sample by specifying the right species argument.

hicfile <- DNAZooData(species = 'Hypsibius_dujardini')
##  Fetching Hi-C data from DNAZoo
##  |===================================|  100%
hicfile
##  HicFile object
##  .hic file: /home/rsg/.cache/R/DNAZooData/400d7e2b0145_nHd_3.1_HiC.hic
##  resolution: 5000
##  pairs file:
##  metadata(6): organism draftAssembly ... credits assemblyURL

Here again, the resulting HicFile is populated with metadata parsed from the DNA Zoo data portal.

metadata(hicfile)$organism
##  $vernacular
##  [1] "Tardigrade"
##  
##  $binomial
##  [1] "Hypsibius dujardini"
##  
##  $funFact
##  [1] "<i>Hypsibius dujardini</i> is a species of tardigrade, a tiny microscopic organism. They are also commonly called water bears. This species is found world-wide!"
##  
##  $extraInfo
##  [1] "on BioKIDS website"
##  
##  $extraInfoLink
##  [1] "http://www.biokids.umich.edu/critters/Hypsibius_dujardini/"
##  
##  $image
##  [1] "https://static.wixstatic.com/media/2b9330_82db39c219f24b20a75cb38943aad1fb~mv2.jpg"
##  
##  $imageCredit
##  [1] "By Willow Gabriel, Goldstein Lab - https://www.flickr.com/photos/waterbears/1614095719/ Template:Uploader Transferred from en.wikipedia to Commons., CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curi
##  d=2261992"
##  
##  $isChromognomes
##  [1] "FALSE"
##  
##  $taxonomy
##  [1] "Species:202423-914154-914155-914158-155166-155362-710171-710179-710192-155390-155420"

HiCFile metadata also points to a URL to directly fetch the genome assembly corrected by the DNA Zoo consortium.

metadata(hicfile)$assemblyURL
##  [1] "https://dnazoo.s3.wasabisys.com/Hypsibius_dujardini/nHd_3.1_HiC.fasta.gz"

References

Crane, E., Bian, Q., McCord, R. P., Lajoie, B. R., Wheeler, B. S., Ralston, E. J., Uzawa, S., Dekker, J., & Meyer, B. J. (2015). Condensin-driven remodelling of x chromosome topology during dosage compensation. Nature, 523(7559), 240–244. https://doi.org/10.1038/nature14450
Back to top