HiCool
The HiCool R/Bioconductor package provides an
end-to-end interface to process and normalize Hi-C
paired-end fastq reads into .(m)cool files.
hicstuff python
library (https://github.com/koszullab/hicstuff).hicstuff.cooler (https://github.com/open2c/cooler)
library is used to parse pairs into a multi-resolution, balanced
.mcool file. .(m)cool is a compact, indexed
HDF5 file format specifically tailored for efficiently storing HiC-based
data. The .(m)cool file format was developed by Abdennur
and Mirny and published in
2019.basilisk environment.The main processing function offered in this package is
HiCool(). To process .fastq reads into
.pairs & .mcool files, one needs to
provide:
r1 and
r2);.fasta sequence
file, a path to a pre-computed bowtie2 index or a supported
ID character (hg38, mm10, dm6,
R64-1-1, WBcel235, GRCz10,
Galgal4);
x <- HiCool(
r1 = '<PATH-TO-R1.fq.gz>',
r2 = '<PATH-TO-R2.fq.gz>',
restriction = '<RE1(,RE2)>',
resolutions = "<resolutions of interest>",
genome = '<GENOME_ID>'
)Here is a concrete example of Hi-C data processing.
HiContactsData package..mcool file will have three levels of
resolutions, from 1000bp to 8000bp.R64-1-1, the yeast genome
reference.output/
directory.
library(HiCool)
hcf <- HiCool(
r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'),
r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'),
restriction = 'DpnII,HinfI',
resolutions = c(4000, 8000, 16000),
genome = 'R64-1-1',
output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpMpAbUV/WL4DIE]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /__w/HiCool/HiCool/vignettes/HiCool/847264387f3_7833^mapped-R64-1-1^WL4DIE.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/847264387f3_7833^mapped-R64-1-1^WL4DIE.mcool
#> resolution: 4000
#> pairs file: ./HiCool//pairs/847264387f3_7833^mapped-R64-1-1^WL4DIE.pairs
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/847264387f3_7833^mapped-R64-1-1^WL4DIE.log"
#>
#> $args
#> $args$r1
#> [1] "/github/home/.cache/R/ExperimentHub/847264387f3_7833"
#>
#> $args$r2
#> [1] "/github/home/.cache/R/ExperimentHub/84723a7b0539_7834"
#>
#> $args$genome
#> [1] "/tmp/RtmpMpAbUV/R64-1-1"
#>
#> $args$resolutions
#> [1] "4000"
#>
#> $args$resolutions
#> [1] "8000"
#>
#> $args$resolutions
#> [1] "16000"
#>
#> $args$restriction
#> [1] "DpnII,HinfI"
#>
#> $args$iterative
#> [1] TRUE
#>
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#>
#> $args$threads
#> [1] 1
#>
#> $args$output
#> [1] "./HiCool/"
#>
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#>
#> $args$keep_bam
#> [1] FALSE
#>
#> $args$scratch
#> [1] "/tmp/RtmpMpAbUV"
#>
#> $args$wd
#> [1] "/__w/HiCool/HiCool/vignettes"
#>
#>
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#>
#> $stats$nPairs
#> [1] 73993
#>
#> $stats$nDangling
#> [1] 10027
#>
#> $stats$nSelf
#> [1] 2205
#>
#> $stats$nDumped
#> [1] 83
#>
#> $stats$nFiltered
#> [1] 61678
#>
#> $stats$nDups
#> [1] 719
#>
#> $stats$nUnique
#> [1] 60959
#>
#> $stats$threshold_uncut
#> [1] 7
#>
#> $stats$threshold_self
#> [1] 7Extra optional arguments can be passed to the hicstuff
workhorse library:
iterative TRUE): By
default, hicstuff first truncates your set of reads to 20bp
and attempts to align the truncated reads, then moves on to aligning
40bp-truncated reads for those which could not be mapped, etc. This
procedure is longer than a traditional mapping but allows for more pairs
to be rescued. Set to FALSE if you want to perform standard
alignment of fastq files without iterative alignment;balancing_args " --min-nnz 10 --mad-max 5 "): Specify here any balancing
argument to be used by cooler when normalizing the binned
contact matrices. Full list of options available at cooler
documentation website;threads 1L): Number of
CPUs to use to process data;exclude_chr 'Mito|chrM|MT'): List here any chromosome you wish to
remove from the final contact matrix file;keep_bam FALSE): Set
to TRUE if you wish to keep the pair of .bam
files;scratch tempdir()):
Points to a temporary directory to be used for processing.The important files generated by HiCool are the
following:
<output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log
<output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool
.pairs file:
<output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs
<output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf.The diagnosis plots illustrate how pairs were filtered during the
processing, using a strategy described in
Cournac et al., BMC Genomics 2012. The
event_distance chart represents the frequency of
++, +-, -+ and --
pairs in the library, as a function of the number of restriction sites
between each end of the pairs, and shows the inferred filtering
threshold. The event_distribution chart indicates the
proportion of each type of pairs (e.g. dangling,
uncut, abnormal, …) and the total number of
pairs retained (3D intra + 3D inter).
Notes:
.pairs file format is defined by the 4DN
consortium;.(m)cool file format is defined by cooler
authors in the supporting
publication.Processing Hi-C sequencing libraries into .pairs and
.mcool files requires several dependencies, to (1) align
reads to a reference genome, (2) manage alignment files (SAM), (3)
filter pairs, (4) bin them to a specific resolution and (5)
All system dependencies are internally managed by
basilisk. HiCool maintains a
basilisk environment containing:
python 3.9.1bowtie2 2.4.5samtools 1.7hicstuff 3.1.5cooler 0.8.11chromosight 1.6.3The first time HiCool() is executed, a fresh
basilisk environment will be created and required
dependencies automatically installed. This ensures compatibility between
the different system dependencies needed to process Hi-C fastq
files.
sessionInfo()
#> R Under development (unstable) (2025-03-08 r87910)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] HiContactsData_1.9.0 ExperimentHub_2.15.0 AnnotationHub_3.15.0
#> [4] BiocFileCache_2.15.1 dbplyr_2.5.0 BiocGenerics_0.53.6
#> [7] generics_0.1.3 HiCool_1.7.1 HiCExperiment_1.7.0
#> [10] BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 rlang_1.1.5
#> [3] magrittr_2.0.3 matrixStats_1.5.0
#> [5] compiler_4.5.0 RSQLite_2.3.9
#> [7] dir.expiry_1.15.0 png_0.1-8
#> [9] systemfonts_1.2.1 vctrs_0.6.5
#> [11] stringr_1.5.1 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0
#> [15] XVector_0.47.2 rmdformats_1.0.4
#> [17] rmarkdown_2.29 sessioninfo_1.2.3
#> [19] tzdb_0.5.0 UCSC.utils_1.3.1
#> [21] strawr_0.0.92 ragg_1.3.3
#> [23] purrr_1.0.4 bit_4.6.0
#> [25] xfun_0.51 cachem_1.1.0
#> [27] GenomeInfoDb_1.43.4 jsonlite_1.9.1
#> [29] blob_1.2.4 rhdf5filters_1.19.2
#> [31] DelayedArray_0.33.6 Rhdf5lib_1.29.1
#> [33] BiocParallel_1.41.2 parallel_4.5.0
#> [35] R6_2.6.1 bslib_0.9.0
#> [37] stringi_1.8.4 reticulate_1.41.0.1
#> [39] GenomicRanges_1.59.1 jquerylib_0.1.4
#> [41] Rcpp_1.0.14 bookdown_0.42
#> [43] SummarizedExperiment_1.37.0 knitr_1.50
#> [45] IRanges_2.41.3 Matrix_1.7-3
#> [47] tidyselect_1.2.1 abind_1.4-8
#> [49] yaml_2.3.10 codetools_0.2-20
#> [51] curl_6.2.1 lattice_0.22-6
#> [53] tibble_3.2.1 withr_3.0.2
#> [55] KEGGREST_1.47.0 InteractionSet_1.35.0
#> [57] Biobase_2.67.0 basilisk.utils_1.19.1
#> [59] evaluate_1.0.3 desc_1.4.3
#> [61] Biostrings_2.75.4 pillar_1.10.1
#> [63] BiocManager_1.30.25 filelock_1.0.3
#> [65] MatrixGenerics_1.19.1 stats4_4.5.0
#> [67] plotly_4.10.4 vroom_1.6.5
#> [69] BiocVersion_3.21.1 S4Vectors_0.45.4
#> [71] ggplot2_3.5.1 munsell_0.5.1
#> [73] scales_1.3.0 glue_1.8.0
#> [75] lazyeval_0.2.2 tools_4.5.0
#> [77] BiocIO_1.17.1 data.table_1.17.0
#> [79] fs_1.6.5 rhdf5_2.51.2
#> [81] grid_4.5.0 tidyr_1.3.1
#> [83] crosstalk_1.2.1 AnnotationDbi_1.69.0
#> [85] colorspace_2.1-1 GenomeInfoDbData_1.2.14
#> [87] basilisk_1.19.1 cli_3.6.4
#> [89] rappdirs_0.3.3 textshaping_1.0.0
#> [91] S4Arrays_1.7.3 viridisLite_0.4.2
#> [93] dplyr_1.1.4 gtable_0.3.6
#> [95] sass_0.4.9 digest_0.6.37
#> [97] SparseArray_1.7.6 htmlwidgets_1.6.4
#> [99] memoise_2.0.1 htmltools_0.5.8.1
#> [101] pkgdown_2.1.1 lifecycle_1.0.4
#> [103] httr_1.4.7 mime_0.13
#> [105] bit64_4.6.0-1