Skip to contents

Introduction

The DNA Zoo Consortium is a collaborative group whose aim is to correct and refine genome assemblies across the tree of life using Hi-C approaches. As of 2023, they have performed Hi-C across more than 300 animal, plant and fungi species.

DNAZooData is a package giving programmatic access to these uniformly processed Hi-C contact files, as well as the refined genome assemblies.

The DNAZooData() function provides a gateway to DNA Zoo-hosted Hi-C files, fetching and caching relevant contact matrices in .hic format, and providing a direct URL to corrcted genome assemblies in fasta format. It returns a HicFile object, which can then be imported in memory using HiCExperiment::import().

HiCFile metadata also points to a URL to directly fetch the genome assembly corrected by the DNA Zoo consortium.

library(DNAZooData)
head(DNAZooData())
#>                   species                              readme
#> 1        Acinonyx_jubatus        Acinonyx_jubatus/README.json
#> 2      Acropora_millepora      Acropora_millepora/README.json
#> 3     Addax_nasomaculatus     Addax_nasomaculatus/README.json
#> 4           Aedes_aegypti           Aedes_aegypti/README.json
#> 5   Aedes_aegypti__AaegL4   Aedes_aegypti__AaegL4/README.json
#> 6 Aedes_aegypti__AaegL5.0 Aedes_aegypti__AaegL5.0/README.json
#>                                                           readme_link
#> 1        https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/README.json
#> 2      https://dnazoo.s3.wasabisys.com/Acropora_millepora/README.json
#> 3     https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/README.json
#> 4           https://dnazoo.s3.wasabisys.com/Aedes_aegypti/README.json
#> 5   https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/README.json
#> 6 https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/README.json
#>   original_assembly     new_assembly
#> 1           aciJub1      aciJub1_HiC
#> 2       amil_sf_1.1  amil_sf_1.1_HiC
#> 3      ASM1959352v1 ASM1959352v1_HiC
#> 4        AGWG.draft         AaegL5.0
#> 5            AaegL3           AaegL4
#> 6        AGWG.draft         AaegL5.0
#>                                                               new_assembly_link
#> 1         https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/aciJub1_HiC.fasta.gz
#> 2   https://dnazoo.s3.wasabisys.com/Acropora_millepora/amil_sf_1.1_HiC.fasta.gz
#> 3 https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/ASM1959352v1_HiC.fasta.gz
#> 4               https://dnazoo.s3.wasabisys.com/Aedes_aegypti/AaegL5.0.fasta.gz
#> 5         https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/AaegL4.fasta.gz
#> 6     https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/AaegL5.0.fasta.gz
#>   new_assembly_link_status
#> 1                      200
#> 2                      200
#> 3                      200
#> 4                      404
#> 5                      200
#> 6                      200
#>                                                                   hic_link
#> 1    https://dnazoo.s3.wasabisys.com/Acinonyx_jubatus/aciJub1.rawchrom.hic
#> 2   https://dnazoo.s3.wasabisys.com/Acropora_millepora/amil_sf_1.1_HiC.hic
#> 3 https://dnazoo.s3.wasabisys.com/Addax_nasomaculatus/ASM1959352v1_HiC.hic
#> 4                                                                     <NA>
#> 5         https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL4/AaegL4.hic
#> 6     https://dnazoo.s3.wasabisys.com/Aedes_aegypti__AaegL5.0/AaegL5.0.hic
hicfile <- DNAZooData(species = 'Hypsibius_dujardini')
S4Vectors::metadata(hicfile)$organism
#> $vernacular
#> [1] "Tardigrade"
#> 
#> $binomial
#> [1] "Hypsibius dujardini"
#> 
#> $funFact
#> [1] "<i>Hypsibius dujardini</i> is a species of tardigrade, a tiny microscopic organism. They are also commonly called water bears. This species is found world-wide!"
#> 
#> $extraInfo
#> [1] "on BioKIDS website"
#> 
#> $extraInfoLink
#> [1] "http://www.biokids.umich.edu/critters/Hypsibius_dujardini/"
#> 
#> $image
#> [1] "https://static.wixstatic.com/media/2b9330_82db39c219f24b20a75cb38943aad1fb~mv2.jpg"
#> 
#> $imageCredit
#> [1] "By Willow Gabriel, Goldstein Lab - https://www.flickr.com/photos/waterbears/1614095719/ Template:Uploader Transferred from en.wikipedia to Commons., CC BY-SA 2.5, https://commons.wikimedia.org/w/index.php?curid=2261992"
#> 
#> $isChromognomes
#> [1] "FALSE"
#> 
#> $taxonomy
#> [1] "Species:202423-914154-914155-914158-155166-155362-710171-710179-710192-155390-155420"
S4Vectors::metadata(hicfile)$assemblyURL
#> [1] "https://dnazoo.s3.wasabisys.com/Hypsibius_dujardini/nHd_3.1_HiC.fasta.gz"

Installation

DNAZooData package can be installed from Bioconductor using the following command:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DNAZooData")

HiCExperiment and DNAZooData

The HiCExperiment package can be used to import .hic files provided by DNAZooData. Refer to HiCExperiment package documentation for further information.

availableResolutions(hicfile)
#> resolutions(9): 5000 10000 ... 1e+06 2500000
#> 
availableChromosomes(hicfile)
#> Seqinfo object with 5 sequences from an unspecified genome:
#>   seqnames       seqlengths isCircular genome
#>   HiC_scaffold_1   24848249         NA   <NA>
#>   HiC_scaffold_2   19596306         NA   <NA>
#>   HiC_scaffold_3   18943121         NA   <NA>
#>   HiC_scaffold_4   17897025         NA   <NA>
#>   HiC_scaffold_5   16309062         NA   <NA>
x <- import(hicfile, resolution = 10000, focus = 'HiC_scaffold_4')
x
#> `HiCExperiment` object with 435,341 contacts over 1,790 regions 
#> -------
#> fileName: "/github/home/.cache/R/DNAZooData/645479c03f93_nHd_3.1_HiC.hic" 
#> focus: "HiC_scaffold_4" 
#> resolutions(9): 5000 10000 ... 1e+06 2500000
#> active resolution: 10000 
#> interactions: 167314 
#> scores(2): count balanced 
#> topologicalFeatures: compartments(0) borders(0) loops(0) viewpoints(0) 
#> pairsFile: N/A 
#> metadata(6): organism draftAssembly ... credits assemblyURL
interactions(x)
#> GInteractions object with 167314 interactions and 4 metadata columns:
#>                 seqnames1           ranges1          seqnames2
#>                     <Rle>         <IRanges>              <Rle>
#>        [1] HiC_scaffold_4           1-10000 --- HiC_scaffold_4
#>        [2] HiC_scaffold_4           1-10000 --- HiC_scaffold_4
#>        [3] HiC_scaffold_4           1-10000 --- HiC_scaffold_4
#>        [4] HiC_scaffold_4           1-10000 --- HiC_scaffold_4
#>        [5] HiC_scaffold_4           1-10000 --- HiC_scaffold_4
#>        ...            ...               ... ...            ...
#>   [167310] HiC_scaffold_4 17870001-17880000 --- HiC_scaffold_4
#>   [167311] HiC_scaffold_4 17870001-17880000 --- HiC_scaffold_4
#>   [167312] HiC_scaffold_4 17880001-17890000 --- HiC_scaffold_4
#>   [167313] HiC_scaffold_4 17880001-17890000 --- HiC_scaffold_4
#>   [167314] HiC_scaffold_4 17890001-17897025 --- HiC_scaffold_4
#>                      ranges2 |   bin_id1   bin_id2     count  balanced
#>                    <IRanges> | <numeric> <numeric> <numeric> <numeric>
#>        [1]           1-10000 |      6340      6340        20 162.01923
#>        [2]       10001-20000 |      6340      6341         4  10.97758
#>        [3]       20001-30000 |      6340      6342         1   2.82559
#>        [4]       30001-40000 |      6340      6343         1   3.00288
#>        [5]       80001-90000 |      6340      6348         1   2.63953
#>        ...               ... .       ...       ...       ...       ...
#>   [167310] 17880001-17890000 |      8127      8128         4   3.91164
#>   [167311] 17890001-17897025 |      8127      8129         4   7.35998
#>   [167312] 17880001-17890000 |      8128      8128       107  99.46175
#>   [167313] 17890001-17897025 |      8128      8129         8  13.99203
#>   [167314] 17890001-17897025 |      8129      8129        47 154.67041
#>   -------
#>   regions: 1790 ranges and 3 metadata columns
#>   seqinfo: 5 sequences from an unspecified genome
as(x, 'ContactMatrix')
#> class: ContactMatrix 
#> dim: 1790 1790 
#> type: dgCMatrix 
#> rownames: NULL
#> colnames: NULL
#> metadata(0):
#> regions: 1790

Session info

sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] DNAZooData_1.1.0    HiCExperiment_1.1.0 BiocStyle_2.29.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.31.0 rjson_0.2.21               
#>  [3] xfun_0.39                   bslib_0.4.2                
#>  [5] rhdf5_2.45.0                Biobase_2.61.0             
#>  [7] lattice_0.21-8              tzdb_0.3.0                 
#>  [9] rhdf5filters_1.13.1         vctrs_0.6.2                
#> [11] tools_4.3.0                 bitops_1.0-7               
#> [13] generics_0.1.3              curl_5.0.0                 
#> [15] stats4_4.3.0                parallel_4.3.0             
#> [17] RSQLite_2.3.1               tibble_3.2.1               
#> [19] fansi_1.0.4                 blob_1.2.4                 
#> [21] pkgconfig_2.0.3             Matrix_1.5-4               
#> [23] dbplyr_2.3.2                desc_1.4.2                 
#> [25] S4Vectors_0.39.0            lifecycle_1.0.3            
#> [27] GenomeInfoDbData_1.2.10     compiler_4.3.0             
#> [29] stringr_1.5.0               textshaping_0.3.6          
#> [31] codetools_0.2-19            InteractionSet_1.29.0      
#> [33] GenomeInfoDb_1.37.0         htmltools_0.5.5            
#> [35] sass_0.4.5                  RCurl_1.98-1.12            
#> [37] yaml_2.3.7                  crayon_1.5.2               
#> [39] pkgdown_2.0.7               pillar_1.9.0               
#> [41] jquerylib_0.1.4             BiocParallel_1.35.0        
#> [43] DelayedArray_0.27.0         cachem_1.0.7               
#> [45] tidyselect_1.2.0            digest_0.6.31              
#> [47] stringi_1.7.12              dplyr_1.1.2                
#> [49] purrr_1.0.1                 bookdown_0.33              
#> [51] rprojroot_2.0.3             fastmap_1.1.1              
#> [53] grid_4.3.0                  cli_3.6.1                  
#> [55] magrittr_2.0.3              utf8_1.2.3                 
#> [57] withr_2.5.0                 filelock_1.0.2             
#> [59] bit64_4.0.5                 httr_1.4.5                 
#> [61] rmarkdown_2.21              XVector_0.41.0             
#> [63] matrixStats_0.63.0          bit_4.0.5                  
#> [65] ragg_1.2.5                  memoise_2.0.1              
#> [67] evaluate_0.20               knitr_1.42                 
#> [69] GenomicRanges_1.53.0        IRanges_2.35.0             
#> [71] BiocIO_1.11.0               BiocFileCache_2.9.0        
#> [73] rlang_1.1.0                 Rcpp_1.0.10                
#> [75] DBI_1.1.3                   glue_1.6.2                 
#> [77] BiocManager_1.30.20         BiocGenerics_0.47.0        
#> [79] vroom_1.6.1                 jsonlite_1.8.4             
#> [81] strawr_0.0.91               R6_2.5.1                   
#> [83] Rhdf5lib_1.23.0             MatrixGenerics_1.13.0      
#> [85] systemfonts_1.0.4           fs_1.6.2                   
#> [87] zlibbioc_1.47.0