Processing Hi-C paired-end fastq files in R

HiCool::HiCool() automatically processes paired-end HiC sequencing files by performing the following steps:

Automatically setting up an appropriate conda environment using basilisk;
Mapping the reads to the provided genome reference using hicstuff and filtering of irrelevant pairs;
Filtering the resulting pairs file to remove unwanted chromosomes (e.g. chrM);
Binning the filtered pairs into a cool file at a chosen resolution;
Generating a multi-resolution mcool file;
Normalizing matrices at each resolution by iterative corretion using cooler.

The filtering strategy used by hicstuff is described in Cournac et al., BMC Genomics 2012.

HiCool(
  r1,
  r2,
  genome,
  restriction = "DpnII,HinfI",
  resolutions = NULL,
  iterative = TRUE,
  balancing_args = " --min-nnz 10 --mad-max 5 ",
  threads = 1L,
  exclude_chr = "Mito|chrM|MT",
  output = "HiCool",
  keep_bam = FALSE,
  build_report = TRUE,
  scratch = tempdir()
)

importHiCoolFolder(output, hash, resolution = NULL)

getHiCoolArgs(log)

getHicStats(log)

Arguments

r1: Path to fastq file (R1 read)
r2: Path to fastq file (R2 read)
genome: Genome used to map the reads on, provided either as a fasta file (in which case the bowtie2 index will be automatically generated), or as a prefix to a bowtie2 index (e.g. mm10 for mm10.*.bt2 files). Genome can also be a unique ID for the following references: hg38, mm10, dm6, R64-1-1, GRZc10, WBcel235, Galgal4.
restriction: Restriction enzyme(s) used in HiC (Default: "DpnII,HinfI")
resolutions: Resolutions used to bin the final mcool file (Default: 5 levels of resolution automatically inferred according to genome size)
iterative: Should the read mapping be performed iteratively? (Default: TRUE)
balancing_args: Balancing arguments for cooler. See cooler documentation here for a list of all available balancing arguments. These defaults match those used by the 4DN consortium.
threads: Number of CPUs used for parallelization. (Default: 1)
exclude_chr: Chromosomes excluded from the final .mcool file. This will not affect the pairs file. (Default: "Mito|chrM|MT")
output: Output folder used by HiCool.
keep_bam: Should the bam files be kept? (Default: FALSE)
build_report: Should an automated report be computed? (Default: TRUE)
scratch: Path to temporary directory where processing will take place. (Default: tempdir())
hash: Unique 6-letter ID used to identify files from a specific HiCool processing run.
resolution: Resolution used to import the mcool file
log: Path to log file generated by hicstuff/hicool

Value

A CoolFile object with prefilled pairsFile and metadata slots.

HiCool utils

importHiCoolFolder(folder, hash) automatically finds the different processed files associated with a specific HiCool::HiCool() processing hash ID.
getHiCoolArgs() parses the log file generated by HiCool::HiCool() during processing to recover which arguments were used.
getHicStats() parses the log file generated by HiCool::HiCool() during processing to recover pre-computed stats about pair numbers, filtering thresholds, etc.

Examples

r1 <- HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1')
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
r2 <- HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2')
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
hcf <- HiCool(r1, r2, genome = 'R64-1-1', output = './HiCool/')
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpiWs7nh/WL4DIE]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Best-suited minimum resolution automatically inferred: 1000
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /__w/HiCool/HiCool/docs/reference/HiCool/847264387f3_7833^mapped-R64-1-1^WL4DIE.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/847264387f3_7833^mapped-R64-1-1^WL4DIE.mcool 
#> resolution: 1000 
#> pairs file: ./HiCool//pairs/847264387f3_7833^mapped-R64-1-1^WL4DIE.pairs 
#> metadata(3): log args stats
getHiCoolArgs(S4Vectors::metadata(hcf)$log)
#> $r1
#> [1] "/github/home/.cache/R/ExperimentHub/847264387f3_7833"
#> 
#> $r2
#> [1] "/github/home/.cache/R/ExperimentHub/84723a7b0539_7834"
#> 
#> $genome
#> [1] "/tmp/RtmpiWs7nh/R64-1-1"
#> 
#> $resolutions
#> [1] NA
#> 
#> $restriction
#> [1] "DpnII,HinfI"
#> 
#> $iterative
#> [1] TRUE
#> 
#> $balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#> 
#> $threads
#> [1] 1
#> 
#> $output
#> [1] "./HiCool/"
#> 
#> $exclude_chr
#> [1] "Mito|chrM|MT"
#> 
#> $keep_bam
#> [1] FALSE
#> 
#> $scratch
#> [1] "/tmp/RtmpiWs7nh"
#> 
#> $wd
#> [1] "/__w/HiCool/HiCool/docs/reference"
#> 
getHicStats(S4Vectors::metadata(hcf)$log)
#> $nFragments
#> [1] 1e+05
#> 
#> $nPairs
#> [1] 73993
#> 
#> $nDangling
#> [1] 10027
#> 
#> $nSelf
#> [1] 2205
#> 
#> $nDumped
#> [1] 83
#> 
#> $nFiltered
#> [1] 61678
#> 
#> $nDups
#> [1] 719
#> 
#> $nUnique
#> [1] 60959
#> 
#> $threshold_uncut
#> [1] 7
#> 
#> $threshold_self
#> [1] 7
#> 
readLines(S4Vectors::metadata(hcf)$log)
#>  [1] "HiCool working directory ::: /__w/HiCool/HiCool/docs/reference"                                  
#>  [2] "HiCool argument ::: r1: /github/home/.cache/R/ExperimentHub/847264387f3_7833"                    
#>  [3] "HiCool argument ::: r2: /github/home/.cache/R/ExperimentHub/84723a7b0539_7834"                   
#>  [4] "HiCool argument ::: genome: /tmp/RtmpiWs7nh/R64-1-1"                                             
#>  [5] "HiCool argument ::: resolutions: "                                                               
#>  [6] "HiCool argument ::: restriction: DpnII,HinfI"                                                    
#>  [7] "HiCool argument ::: iterative: TRUE"                                                             
#>  [8] "HiCool argument ::: balancing_args:  --min-nnz 10 --mad-max 5 "                                  
#>  [9] "HiCool argument ::: threads: 1"                                                                  
#> [10] "HiCool argument ::: output: ./HiCool/"                                                           
#> [11] "HiCool argument ::: exclude_chr: Mito|chrM|MT"                                                   
#> [12] "HiCool argument ::: keep_bam: FALSE"                                                             
#> [13] "HiCool argument ::: scratch: /tmp/RtmpiWs7nh"                                                    
#> [14] "----------------"                                                                                
#> [15] "## hicstuff: v3.1.5 log file"                                                                    
#> [16] "## date: 2025-03-18 15:55:44"                                                                    
#> [17] "## enzyme: DpnII,HinfI"                                                                          
#> [18] "## input1: /github/home/.cache/R/ExperimentHub/847264387f3_7833 "                                
#> [19] "## input2: /github/home/.cache/R/ExperimentHub/84723a7b0539_7834"                                
#> [20] "## ref: /tmp/RtmpiWs7nh/R64-1-1"                                                                 
#> [21] "---"                                                                                             
#> [22] "2025-03-18,15:55:47 :: INFO :: 100000 reads to parse"                                            
#> [23] "2025-03-18,15:55:47 :: INFO :: Truncating unaligned reads to 20bp and mapping."                  
#> [24] "2025-03-18,15:55:49 :: INFO :: 19256 reads left to map."                                         
#> [25] "2025-03-18,15:55:49 :: INFO :: Trying to map unaligned reads at full length (35bp)."             
#> [26] "2025-03-18,15:55:50 :: INFO :: 18387 reads left to map."                                         
#> [27] "2025-03-18,15:55:50 :: INFO :: 81613 reads aligned / 100000 total reads."                        
#> [28] "2025-03-18,15:55:50 :: INFO :: 100000 reads to parse"                                            
#> [29] "2025-03-18,15:55:50 :: INFO :: Truncating unaligned reads to 20bp and mapping."                  
#> [30] "2025-03-18,15:55:52 :: INFO :: 20617 reads left to map."                                         
#> [31] "2025-03-18,15:55:52 :: INFO :: Trying to map unaligned reads at full length (35bp)."             
#> [32] "2025-03-18,15:55:53 :: INFO :: 19373 reads left to map."                                         
#> [33] "2025-03-18,15:55:53 :: INFO :: 80627 reads aligned / 100000 total reads."                        
#> [34] "2025-03-18,15:55:57 :: INFO :: 81% reads (single ends) mapped with Q >= 10.0 (162240/200000)"    
#> [35] "2025-03-18,15:55:59 :: INFO :: Filtering with thresholds: uncuts=7 loops=7"                      
#> [36] "2025-03-18,15:56:00 :: INFO :: Proportion of inter contacts: 22.03% (intra: 48089, inter: 13589)"
#> [37] "2025-03-18,15:56:00 :: INFO :: 12315 pairs discarded: Loops: 2205, Uncuts: 10027, Weirds: 83"    
#> [38] "2025-03-18,15:56:00 :: INFO :: 61678 pairs kept (83.36%)"                                        
#> [39] "2025-03-18,15:56:02 :: INFO :: 1% PCR duplicates have been filtered out (719 / 61678 pairs) "    
#> [40] "2025-03-18,15:56:04 :: INFO :: Contact map generated after 0h 0m 19s"