HiCool

Processing sequencing Hi-C libraries with `HiCool`

The HiCool R/Bioconductor package provides an end-to-end interface to process and normalize Hi-C paired-end fastq reads into .(m)cool files.

The heavy lifting (fastq mapping, pairs parsing and pairs filtering) is performed by the underlying lightweight hicstuff python library (https://github.com/koszullab/hicstuff).
Pairs filering is done using the approach described in Cournac et al., 2012 and implemented in hicstuff.
cooler (https://github.com/open2c/cooler) library is used to parse pairs into a multi-resolution, balanced .mcool file. .(m)cool is a compact, indexed HDF5 file format specifically tailored for efficiently storing HiC-based data. The .(m)cool file format was developed by Abdennur and Mirny and published in 2019.
Internally, all these external dependencies are automatically installed and managed in R by a basilisk environment.

The main processing function offered in this package is HiCool(). To process .fastq reads into .pairs & .mcool files, one needs to provide:

The path to each fastq file (r1 and r2);
The genome reference, as a path to a .fasta sequence file, a path to a pre-computed bowtie2 index or a supported ID character (hg38, mm10, dm6, R64-1-1, WBcel235, GRCz10, Galgal4);
The restriction enzyme(s) used for Hi-C.

x <- HiCool(
    r1 = '<PATH-TO-R1.fq.gz>', 
    r2 = '<PATH-TO-R2.fq.gz>', 
    restriction = '<RE1(,RE2)>', 
    resolutions = "<resolutions of interest>", 
    genome = '<GENOME_ID>'
)

Here is a concrete example of Hi-C data processing.

Example fastq files are retrieved using the HiContactsData package.
Two restriction enzymes are used (these are the enzymes used in the Arima Kit).
The final .mcool file will have three levels of resolutions, from 1000bp to 8000bp.
The data will be mapped on R64-1-1, the yeast genome reference.
All output files will be placed in output/ directory.

library(HiCool)
hcf <- HiCool(
    r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'), 
    r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'), 
    restriction = 'DpnII,HinfI', 
    resolutions = c(4000, 8000, 16000), 
    genome = 'R64-1-1', 
    output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpMpAbUV/WL4DIE]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /__w/HiCool/HiCool/vignettes/HiCool/847264387f3_7833^mapped-R64-1-1^WL4DIE.html
#> HiCool :: All processing successfully achieved. Congrats!

hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/847264387f3_7833^mapped-R64-1-1^WL4DIE.mcool 
#> resolution: 4000 
#> pairs file: ./HiCool//pairs/847264387f3_7833^mapped-R64-1-1^WL4DIE.pairs 
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/847264387f3_7833^mapped-R64-1-1^WL4DIE.log"
#> 
#> $args
#> $args$r1
#> [1] "/github/home/.cache/R/ExperimentHub/847264387f3_7833"
#> 
#> $args$r2
#> [1] "/github/home/.cache/R/ExperimentHub/84723a7b0539_7834"
#> 
#> $args$genome
#> [1] "/tmp/RtmpMpAbUV/R64-1-1"
#> 
#> $args$resolutions
#> [1] "4000"
#> 
#> $args$resolutions
#> [1] "8000"
#> 
#> $args$resolutions
#> [1] "16000"
#> 
#> $args$restriction
#> [1] "DpnII,HinfI"
#> 
#> $args$iterative
#> [1] TRUE
#> 
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $args$threads
#> [1] 1
#> 
#> $args$output
#> [1] "./HiCool/"
#> 
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $args$keep_bam
#> [1] FALSE
#> 
#> $args$scratch
#> [1] "/tmp/RtmpMpAbUV"
#> 
#> $args$wd
#> [1] "/__w/HiCool/HiCool/vignettes"
#> 
#> 
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#> 
#> $stats$nPairs
#> [1] 73993
#> 
#> $stats$nDangling
#> [1] 10027
#> 
#> $stats$nSelf
#> [1] 2205
#> 
#> $stats$nDumped
#> [1] 83
#> 
#> $stats$nFiltered
#> [1] 61678
#> 
#> $stats$nDups
#> [1] 719
#> 
#> $stats$nUnique
#> [1] 60959
#> 
#> $stats$threshold_uncut
#> [1] 7
#> 
#> $stats$threshold_self
#> [1] 7

Optional parameters

Extra optional arguments can be passed to the hicstuff workhorse library:

iterative (default: TRUE): By default, hicstuff first truncates your set of reads to 20bp and attempts to align the truncated reads, then moves on to aligning 40bp-truncated reads for those which could not be mapped, etc. This procedure is longer than a traditional mapping but allows for more pairs to be rescued. Set to FALSE if you want to perform standard alignment of fastq files without iterative alignment;
balancing_args (default: " --min-nnz 10 --mad-max 5 "): Specify here any balancing argument to be used by cooler when normalizing the binned contact matrices. Full list of options available at cooler documentation website;
threads (default: 1L): Number of CPUs to use to process data;
exclude_chr (default: 'Mito|chrM|MT'): List here any chromosome you wish to remove from the final contact matrix file;
keep_bam (default: FALSE): Set to TRUE if you wish to keep the pair of .bam files;
scratch (default: tempdir()): Points to a temporary directory to be used for processing.

Output files

The important files generated by HiCool are the following:

A log file: <output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log
A multi-resolution, balanced contact matrix file: <output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool
A .pairs file: <output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs
Several diagnosis plots: <output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf.

The diagnosis plots illustrate how pairs were filtered during the processing, using a strategy described in Cournac et al., BMC Genomics 2012. The event_distance chart represents the frequency of ++, +-, -+ and -- pairs in the library, as a function of the number of restriction sites between each end of the pairs, and shows the inferred filtering threshold. The event_distribution chart indicates the proportion of each type of pairs (e.g. dangling, uncut, abnormal, …) and the total number of pairs retained (3D intra + 3D inter).

Notes:

.pairs file format is defined by the 4DN consortium;
.(m)cool file format is defined by cooler authors in the supporting publication.

System dependencies

Processing Hi-C sequencing libraries into .pairs and .mcool files requires several dependencies, to (1) align reads to a reference genome, (2) manage alignment files (SAM), (3) filter pairs, (4) bin them to a specific resolution and (5)

All system dependencies are internally managed by basilisk. HiCool maintains a basilisk environment containing:

python 3.9.1
bowtie2 2.4.5
samtools 1.7
hicstuff 3.1.5
cooler 0.8.11
chromosight 1.6.3

The first time HiCool() is executed, a fresh basilisk environment will be created and required dependencies automatically installed. This ensures compatibility between the different system dependencies needed to process Hi-C fastq files.

Session info

sessionInfo()
#> R Under development (unstable) (2025-03-08 r87910)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] HiContactsData_1.9.0 ExperimentHub_2.15.0 AnnotationHub_3.15.0
#>  [4] BiocFileCache_2.15.1 dbplyr_2.5.0         BiocGenerics_0.53.6 
#>  [7] generics_0.1.3       HiCool_1.7.1         HiCExperiment_1.7.0 
#> [10] BiocStyle_2.35.0    
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                   rlang_1.1.5                
#>   [3] magrittr_2.0.3              matrixStats_1.5.0          
#>   [5] compiler_4.5.0              RSQLite_2.3.9              
#>   [7] dir.expiry_1.15.0           png_0.1-8                  
#>   [9] systemfonts_1.2.1           vctrs_0.6.5                
#>  [11] stringr_1.5.1               pkgconfig_2.0.3            
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] XVector_0.47.2              rmdformats_1.0.4           
#>  [17] rmarkdown_2.29              sessioninfo_1.2.3          
#>  [19] tzdb_0.5.0                  UCSC.utils_1.3.1           
#>  [21] strawr_0.0.92               ragg_1.3.3                 
#>  [23] purrr_1.0.4                 bit_4.6.0                  
#>  [25] xfun_0.51                   cachem_1.1.0               
#>  [27] GenomeInfoDb_1.43.4         jsonlite_1.9.1             
#>  [29] blob_1.2.4                  rhdf5filters_1.19.2        
#>  [31] DelayedArray_0.33.6         Rhdf5lib_1.29.1            
#>  [33] BiocParallel_1.41.2         parallel_4.5.0             
#>  [35] R6_2.6.1                    bslib_0.9.0                
#>  [37] stringi_1.8.4               reticulate_1.41.0.1        
#>  [39] GenomicRanges_1.59.1        jquerylib_0.1.4            
#>  [41] Rcpp_1.0.14                 bookdown_0.42              
#>  [43] SummarizedExperiment_1.37.0 knitr_1.50                 
#>  [45] IRanges_2.41.3              Matrix_1.7-3               
#>  [47] tidyselect_1.2.1            abind_1.4-8                
#>  [49] yaml_2.3.10                 codetools_0.2-20           
#>  [51] curl_6.2.1                  lattice_0.22-6             
#>  [53] tibble_3.2.1                withr_3.0.2                
#>  [55] KEGGREST_1.47.0             InteractionSet_1.35.0      
#>  [57] Biobase_2.67.0              basilisk.utils_1.19.1      
#>  [59] evaluate_1.0.3              desc_1.4.3                 
#>  [61] Biostrings_2.75.4           pillar_1.10.1              
#>  [63] BiocManager_1.30.25         filelock_1.0.3             
#>  [65] MatrixGenerics_1.19.1       stats4_4.5.0               
#>  [67] plotly_4.10.4               vroom_1.6.5                
#>  [69] BiocVersion_3.21.1          S4Vectors_0.45.4           
#>  [71] ggplot2_3.5.1               munsell_0.5.1              
#>  [73] scales_1.3.0                glue_1.8.0                 
#>  [75] lazyeval_0.2.2              tools_4.5.0                
#>  [77] BiocIO_1.17.1               data.table_1.17.0          
#>  [79] fs_1.6.5                    rhdf5_2.51.2               
#>  [81] grid_4.5.0                  tidyr_1.3.1                
#>  [83] crosstalk_1.2.1             AnnotationDbi_1.69.0       
#>  [85] colorspace_2.1-1            GenomeInfoDbData_1.2.14    
#>  [87] basilisk_1.19.1             cli_3.6.4                  
#>  [89] rappdirs_0.3.3              textshaping_1.0.0          
#>  [91] S4Arrays_1.7.3              viridisLite_0.4.2          
#>  [93] dplyr_1.1.4                 gtable_0.3.6               
#>  [95] sass_0.4.9                  digest_0.6.37              
#>  [97] SparseArray_1.7.6           htmlwidgets_1.6.4          
#>  [99] memoise_2.0.1               htmltools_0.5.8.1          
#> [101] pkgdown_2.1.1               lifecycle_1.0.4            
#> [103] httr_1.4.7                  mime_0.13                  
#> [105] bit64_4.6.0-1

Jacques Serizay

2025-03-18

Processing sequencing Hi-C libraries with `HiCool`

Optional parameters

Output files

System dependencies

Session info

HiCool

Jacques Serizay

2025-03-18

Processing sequencing Hi-C libraries with HiCool

Optional parameters

Output files

System dependencies

Session info

Processing sequencing Hi-C libraries with `HiCool`