HiCool
The HiCool
R/Bioconductor package provides an
end-to-end interface to process and normalize Hi-C
paired-end fastq reads into .(m)cool
files.
hicstuff
python
library (https://github.com/koszullab/hicstuff).hicstuff
.cooler
(https://github.com/open2c/cooler)
library is used to parse pairs into a multi-resolution, balanced
.mcool
file. .(m)cool
is a compact, indexed
HDF5 file format specifically tailored for efficiently storing HiC-based
data. The .(m)cool
file format was developed by Abdennur
and Mirny and published in
2019.basilisk
environment.The main processing function offered in this package is
HiCool()
. To process .fastq
reads into
.pairs
& .mcool
files, one needs to
provide:
r1
and
r2
);.fasta
sequence
file, a path to a pre-computed bowtie2
index or a supported
ID character (hg38
, mm10
, dm6
,
R64-1-1
, WBcel235
, GRCz10
,
Galgal4
);
x <- HiCool(
r1 = '<PATH-TO-R1.fq.gz>',
r2 = '<PATH-TO-R2.fq.gz>',
restriction = '<RE1(,RE2)>',
resolutions = "<resolutions of interest>",
genome = '<GENOME_ID>'
)
Here is a concrete example of Hi-C data processing.
HiContactsData
package..mcool
file will have three levels of
resolutions, from 1000bp to 8000bp.R64-1-1
, the yeast genome
reference.output/
directory.
library(HiCool)
hcf <- HiCool(
r1 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1'),
r2 = HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2'),
restriction = 'DpnII,HinfI',
resolutions = c(4000, 8000, 16000),
genome = 'R64-1-1',
output = './HiCool/'
)
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/Rtmpn3QOpK/LCWLV7]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /__w/HiCool/HiCool/vignettes/HiCool/1fe971273e9d_7833^mapped-R64-1-1^LCWLV7.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/1fe971273e9d_7833^mapped-R64-1-1^LCWLV7.mcool
#> resolution: 4000
#> pairs file: ./HiCool//pairs/1fe971273e9d_7833^mapped-R64-1-1^LCWLV7.pairs
#> metadata(3): log args stats
S4Vectors::metadata(hcf)
#> $log
#> [1] "./HiCool//logs/1fe971273e9d_7833^mapped-R64-1-1^LCWLV7.log"
#>
#> $args
#> $args$r1
#> [1] "/github/home/.cache/R/ExperimentHub/1fe971273e9d_7833"
#>
#> $args$r2
#> [1] "/github/home/.cache/R/ExperimentHub/1fe923fbe457_7834"
#>
#> $args$genome
#> [1] "/tmp/Rtmpn3QOpK/R64-1-1"
#>
#> $args$resolutions
#> [1] "4000"
#>
#> $args$resolutions
#> [1] "8000"
#>
#> $args$resolutions
#> [1] "16000"
#>
#> $args$restriction
#> [1] "DpnII,HinfI"
#>
#> $args$iterative
#> [1] TRUE
#>
#> $args$balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#>
#> $args$threads
#> [1] 1
#>
#> $args$output
#> [1] "./HiCool/"
#>
#> $args$exclude_chr
#> [1] "Mito|chrM|MT"
#>
#> $args$keep_bam
#> [1] FALSE
#>
#> $args$scratch
#> [1] "/tmp/Rtmpn3QOpK"
#>
#> $args$wd
#> [1] "/__w/HiCool/HiCool/vignettes"
#>
#>
#> $stats
#> $stats$nFragments
#> [1] 1e+05
#>
#> $stats$nPairs
#> [1] 73993
#>
#> $stats$nDangling
#> [1] 10027
#>
#> $stats$nSelf
#> [1] 2205
#>
#> $stats$nDumped
#> [1] 83
#>
#> $stats$nFiltered
#> [1] 61678
#>
#> $stats$nDups
#> [1] 719
#>
#> $stats$nUnique
#> [1] 60959
#>
#> $stats$threshold_uncut
#> [1] 7
#>
#> $stats$threshold_self
#> [1] 7
Extra optional arguments can be passed to the hicstuff
workhorse library:
iterative
TRUE
): By
default, hicstuff
first truncates your set of reads to 20bp
and attempts to align the truncated reads, then moves on to aligning
40bp-truncated reads for those which could not be mapped, etc. This
procedure is longer than a traditional mapping but allows for more pairs
to be rescued. Set to FALSE
if you want to perform standard
alignment of fastq files without iterative alignment;balancing_args
" --min-nnz 10 --mad-max 5 "
): Specify here any balancing
argument to be used by cooler
when normalizing the binned
contact matrices. Full list of options available at cooler
documentation website;threads
1L
): Number of
CPUs to use to process data;exclude_chr
'Mito|chrM|MT'
): List here any chromosome you wish to
remove from the final contact matrix file;keep_bam
FALSE
): Set
to TRUE
if you wish to keep the pair of .bam
files;scratch
tempdir()
):
Points to a temporary directory to be used for processing.The important files generated by HiCool
are the
following:
<output_folder>/logs/<prefix>^mapped-<genome>^<hash>.log
<output_folder>/matrices/<prefix>^mapped-<genome>^<hash>.mcool
.pairs
file:
<output_folder>/pairs/<prefix>^mapped-<genome>^<hash>.pairs
<output_folder>/plots/<prefix>^mapped-<genome>^<hash>_*.pdf
.The diagnosis plots illustrate how pairs were filtered during the
processing, using a strategy described in
Cournac et al., BMC Genomics 2012
. The
event_distance
chart represents the frequency of
++
, +-
, -+
and --
pairs in the library, as a function of the number of restriction sites
between each end of the pairs, and shows the inferred filtering
threshold. The event_distribution
chart indicates the
proportion of each type of pairs (e.g. dangling
,
uncut
, abnormal
, …) and the total number of
pairs retained (3D intra
+ 3D inter
).
Notes:
.pairs
file format is defined by the 4DN
consortium;.(m)cool
file format is defined by cooler
authors in the supporting
publication.Processing Hi-C sequencing libraries into .pairs
and
.mcool
files requires several dependencies, to (1) align
reads to a reference genome, (2) manage alignment files (SAM), (3)
filter pairs, (4) bin them to a specific resolution and (5)
All system dependencies are internally managed by
basilisk
. HiCool
maintains a
basilisk
environment containing:
python 3.9.1
bowtie2 2.4.5
samtools 1.7
hicstuff 3.1.5
cooler 0.8.11
chromosight 1.6.3
The first time HiCool()
is executed, a fresh
basilisk
environment will be created and required
dependencies automatically installed. This ensures compatibility between
the different system dependencies needed to process Hi-C fastq
files.
sessionInfo()
#> R version 4.3.0 (2023-04-21)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics utils methods base
#>
#> other attached packages:
#> [1] HiContactsData_1.3.0 ExperimentHub_2.7.1 AnnotationHub_3.9.1
#> [4] BiocFileCache_2.9.0 dbplyr_2.3.2 BiocGenerics_0.47.0
#> [7] HiCool_1.1.0 HiCExperiment_1.1.0 BiocStyle_2.29.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.1.3 bitops_1.0-7
#> [3] rlang_1.1.1 magrittr_2.0.3
#> [5] matrixStats_0.63.0 compiler_4.3.0
#> [7] RSQLite_2.3.1 dir.expiry_1.9.0
#> [9] png_0.1-8 systemfonts_1.0.4
#> [11] vctrs_0.6.2 stringr_1.5.0
#> [13] pkgconfig_2.0.3 crayon_1.5.2
#> [15] fastmap_1.1.1 ellipsis_0.3.2
#> [17] XVector_0.41.1 rmdformats_1.0.4
#> [19] utf8_1.2.3 promises_1.2.0.1
#> [21] rmarkdown_2.21 grDevices_4.3.0
#> [23] sessioninfo_1.2.2 tzdb_0.4.0
#> [25] strawr_0.0.91 ragg_1.2.5
#> [27] purrr_1.0.1 bit_4.0.5
#> [29] xfun_0.39 zlibbioc_1.47.0
#> [31] cachem_1.0.8 GenomeInfoDb_1.37.1
#> [33] jsonlite_1.8.4 blob_1.2.4
#> [35] later_1.3.1 rhdf5filters_1.13.2
#> [37] DelayedArray_0.27.3 interactiveDisplayBase_1.39.0
#> [39] Rhdf5lib_1.23.0 BiocParallel_1.35.1
#> [41] parallel_4.3.0 R6_2.5.1
#> [43] bslib_0.4.2 stringi_1.7.12
#> [45] reticulate_1.28 GenomicRanges_1.53.1
#> [47] jquerylib_0.1.4 Rcpp_1.0.10
#> [49] bookdown_0.34 SummarizedExperiment_1.31.1
#> [51] knitr_1.42 IRanges_2.35.1
#> [53] httpuv_1.6.11 Matrix_1.5-4.1
#> [55] tidyselect_1.2.0 yaml_2.3.7
#> [57] codetools_0.2-19 curl_5.0.0
#> [59] lattice_0.21-8 tibble_3.2.1
#> [61] withr_2.5.0 KEGGREST_1.41.0
#> [63] shiny_1.7.4 InteractionSet_1.29.0
#> [65] Biobase_2.61.0 basilisk.utils_1.13.1
#> [67] evaluate_0.21 desc_1.4.2
#> [69] Biostrings_2.69.1 pillar_1.9.0
#> [71] BiocManager_1.30.20 filelock_1.0.2
#> [73] MatrixGenerics_1.13.0 stats4_4.3.0
#> [75] plotly_4.10.1 generics_0.1.3
#> [77] vroom_1.6.3 rprojroot_2.0.3
#> [79] RCurl_1.98-1.12 BiocVersion_3.18.0
#> [81] S4Vectors_0.39.1 ggplot2_3.4.2
#> [83] munsell_0.5.0 scales_1.2.1
#> [85] xtable_1.8-4 glue_1.6.2
#> [87] lazyeval_0.2.2 tools_4.3.0
#> [89] datasets_4.3.0 BiocIO_1.11.0
#> [91] data.table_1.14.8 fs_1.6.2
#> [93] rhdf5_2.45.0 grid_4.3.0
#> [95] tidyr_1.3.0 crosstalk_1.2.0
#> [97] AnnotationDbi_1.63.1 colorspace_2.1-0
#> [99] GenomeInfoDbData_1.2.10 basilisk_1.13.0
#> [101] cli_3.6.1 rappdirs_0.3.3
#> [103] textshaping_0.3.6 fansi_1.0.4
#> [105] S4Arrays_1.1.4 viridisLite_0.4.2
#> [107] dplyr_1.1.2 gtable_0.3.3
#> [109] sass_0.4.6 digest_0.6.31
#> [111] SparseArray_1.1.5 htmlwidgets_1.6.2
#> [113] memoise_2.0.1 htmltools_0.5.5
#> [115] pkgdown_2.0.7 lifecycle_1.0.3
#> [117] here_1.0.1 httr_1.4.6
#> [119] mime_0.12 bit64_4.0.5