HiCool::HiCool()
automatically processes paired-end HiC sequencing files
by performing the following steps:
Automatically setting up an appropriate conda environment using basilisk;
Mapping the reads to the provided genome reference using hicstuff
and filtering of irrelevant pairs;
Filtering the resulting pairs file to remove unwanted chromosomes (e.g. chrM);
Binning the filtered pairs into a cool file at a chosen resolution;
Generating a multi-resolution mcool file;
Normalizing matrices at each resolution by iterative corretion using cooler.
The filtering strategy used by hicstuff
is described in Cournac et al., BMC Genomics 2012.
HiCool(
r1,
r2,
genome,
restriction = "DpnII,HinfI",
resolutions = NULL,
iterative = TRUE,
balancing_args = " --min-nnz 10 --mad-max 5 ",
threads = 1L,
exclude_chr = "Mito|chrM|MT",
output = "HiCool",
keep_bam = FALSE,
build_report = TRUE,
scratch = tempdir()
)
importHiCoolFolder(output, hash, resolution = NULL)
getHiCoolArgs(log)
getHicStats(log)
Path to fastq file (R1 read)
Path to fastq file (R2 read)
Genome used to map the reads on, provided either
as a fasta file (in which case the bowtie2 index will be automatically
generated), or as a prefix to a bowtie2 index (e.g. mm10
for
mm10.*.bt2
files). Genome can also be a unique ID for the following
references: hg38
, mm10
, dm6
, R64-1-1
, GRZc10
, WBcel235
,
Galgal4
.
Restriction enzyme(s) used in HiC (Default: "DpnII,HinfI")
Resolutions used to bin the final mcool file (Default: 5 levels of resolution automatically inferred according to genome size)
Should the read mapping be performed iteratively? (Default: TRUE)
Balancing arguments for cooler.
See cooler
documentation here
for a list of all available balancing arguments.
These defaults match those used by the 4DN consortium.
Number of CPUs used for parallelization. (Default: 1)
Chromosomes excluded from the final .mcool file. This will not affect the pairs file. (Default: "Mito|chrM|MT")
Output folder used by HiCool.
Should the bam files be kept? (Default: FALSE)
Should an automated report be computed? (Default: TRUE)
Path to temporary directory where processing will take place.
(Default: tempdir()
)
Unique 6-letter ID used to identify files from a specific HiCool processing run.
Resolution used to import the mcool file
Path to log file generated by hicstuff/hicool
A CoolFile
object with prefilled pairsFile
and metadata
slots.
importHiCoolFolder(folder, hash)
automatically finds the different processed files
associated with a specific HiCool::HiCool() processing hash ID.
getHiCoolArgs() parses the log file generated by HiCool::HiCool() during processing to recover which arguments were used.
getHicStats() parses the log file generated by HiCool::HiCool() during processing to recover pre-computed stats about pair numbers, filtering thresholds, etc.
r1 <- HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R1')
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
r2 <- HiContactsData::HiContactsData(sample = 'yeast_wt', format = 'fastq_R2')
#> see ?HiContactsData and browseVignettes('HiContactsData') for documentation
#> loading from cache
hcf <- HiCool(r1, r2, genome = 'R64-1-1', output = './HiCool/')
#> HiCool :: Recovering bowtie2 genome index from AWS iGenomes...
#> HiCool :: Initiating processing of fastq files [tmp folder: /tmp/RtmpLG33tZ/DF8U6P]...
#> HiCool :: Mapping fastq files...
#> HiCool :: Best-suited minimum resolution automatically inferred: 1000
#> HiCool :: Removing unwanted chromosomes...
#> HiCool :: Parsing pairs into .cool file...
#> HiCool :: Generating multi-resolution .mcool file...
#> HiCool :: Balancing .mcool file...
#> HiCool :: Tidying up everything for you...
#> HiCool :: .fastq to .mcool processing done!
#> HiCool :: Check ./HiCool/folder to find the generated files
#> HiCool :: Generating HiCool report. This might take a while.
#> HiCool :: Report generated and available @ /__w/HiCool/HiCool/docs/reference/HiCool/1fe971273e9d_7833^mapped-R64-1-1^DF8U6P.html
#> HiCool :: All processing successfully achieved. Congrats!
hcf
#> CoolFile object
#> .mcool file: ./HiCool//matrices/1fe971273e9d_7833^mapped-R64-1-1^DF8U6P.mcool
#> resolution: 1000
#> pairs file: ./HiCool//pairs/1fe971273e9d_7833^mapped-R64-1-1^DF8U6P.pairs
#> metadata(3): log args stats
getHiCoolArgs(S4Vectors::metadata(hcf)$log)
#> $r1
#> [1] "/github/home/.cache/R/ExperimentHub/1fe971273e9d_7833"
#>
#> $r2
#> [1] "/github/home/.cache/R/ExperimentHub/1fe923fbe457_7834"
#>
#> $genome
#> [1] "/tmp/RtmpLG33tZ/R64-1-1"
#>
#> $resolutions
#> [1] NA
#>
#> $restriction
#> [1] "DpnII,HinfI"
#>
#> $iterative
#> [1] TRUE
#>
#> $balancing_args
#> [1] " --min-nnz 10 --mad-max 5 "
#>
#> $threads
#> [1] 1
#>
#> $output
#> [1] "./HiCool/"
#>
#> $exclude_chr
#> [1] "Mito|chrM|MT"
#>
#> $keep_bam
#> [1] FALSE
#>
#> $scratch
#> [1] "/tmp/RtmpLG33tZ"
#>
#> $wd
#> [1] "/__w/HiCool/HiCool/docs/reference"
#>
getHicStats(S4Vectors::metadata(hcf)$log)
#> $nFragments
#> [1] 1e+05
#>
#> $nPairs
#> [1] 73993
#>
#> $nDangling
#> [1] 10027
#>
#> $nSelf
#> [1] 2205
#>
#> $nDumped
#> [1] 83
#>
#> $nFiltered
#> [1] 61678
#>
#> $nDups
#> [1] 719
#>
#> $nUnique
#> [1] 60959
#>
#> $threshold_uncut
#> [1] 7
#>
#> $threshold_self
#> [1] 7
#>
readLines(S4Vectors::metadata(hcf)$log)
#> [1] "HiCool working directory ::: /__w/HiCool/HiCool/docs/reference"
#> [2] "HiCool argument ::: r1: /github/home/.cache/R/ExperimentHub/1fe971273e9d_7833"
#> [3] "HiCool argument ::: r2: /github/home/.cache/R/ExperimentHub/1fe923fbe457_7834"
#> [4] "HiCool argument ::: genome: /tmp/RtmpLG33tZ/R64-1-1"
#> [5] "HiCool argument ::: resolutions: "
#> [6] "HiCool argument ::: restriction: DpnII,HinfI"
#> [7] "HiCool argument ::: iterative: TRUE"
#> [8] "HiCool argument ::: balancing_args: --min-nnz 10 --mad-max 5 "
#> [9] "HiCool argument ::: threads: 1"
#> [10] "HiCool argument ::: output: ./HiCool/"
#> [11] "HiCool argument ::: exclude_chr: Mito|chrM|MT"
#> [12] "HiCool argument ::: keep_bam: FALSE"
#> [13] "HiCool argument ::: scratch: /tmp/RtmpLG33tZ"
#> [14] "----------------"
#> [15] "## hicstuff: v3.1.5 log file"
#> [16] "## date: 2023-05-22 12:14:05"
#> [17] "## enzyme: DpnII,HinfI"
#> [18] "## input1: /github/home/.cache/R/ExperimentHub/1fe971273e9d_7833 "
#> [19] "## input2: /github/home/.cache/R/ExperimentHub/1fe923fbe457_7834"
#> [20] "## ref: /tmp/RtmpLG33tZ/R64-1-1"
#> [21] "---"
#> [22] "2023-05-22,12:14:08 :: INFO :: 100000 reads to parse"
#> [23] "2023-05-22,12:14:08 :: INFO :: Truncating unaligned reads to 20bp and mapping."
#> [24] "2023-05-22,12:14:11 :: INFO :: 19256 reads left to map."
#> [25] "2023-05-22,12:14:11 :: INFO :: Trying to map unaligned reads at full length (35bp)."
#> [26] "2023-05-22,12:14:12 :: INFO :: 18387 reads left to map."
#> [27] "2023-05-22,12:14:12 :: INFO :: 81613 reads aligned / 100000 total reads."
#> [28] "2023-05-22,12:14:13 :: INFO :: 100000 reads to parse"
#> [29] "2023-05-22,12:14:13 :: INFO :: Truncating unaligned reads to 20bp and mapping."
#> [30] "2023-05-22,12:14:16 :: INFO :: 20617 reads left to map."
#> [31] "2023-05-22,12:14:16 :: INFO :: Trying to map unaligned reads at full length (35bp)."
#> [32] "2023-05-22,12:14:16 :: INFO :: 19373 reads left to map."
#> [33] "2023-05-22,12:14:17 :: INFO :: 80627 reads aligned / 100000 total reads."
#> [34] "2023-05-22,12:14:21 :: INFO :: 81% reads (single ends) mapped with Q >= 10.0 (162240/200000)"
#> [35] "2023-05-22,12:14:24 :: INFO :: Filtering with thresholds: uncuts=7 loops=7"
#> [36] "2023-05-22,12:14:25 :: INFO :: Proportion of inter contacts: 22.03% (intra: 48089, inter: 13589)"
#> [37] "2023-05-22,12:14:25 :: INFO :: 12315 pairs discarded: Loops: 2205, Uncuts: 10027, Weirds: 83"
#> [38] "2023-05-22,12:14:25 :: INFO :: 61678 pairs kept (83.36%)"
#> [39] "2023-05-22,12:14:27 :: INFO :: 1% PCR duplicates have been filtered out (719 / 61678 pairs) "
#> [40] "2023-05-22,12:14:29 :: INFO :: Contact map generated after 0h 0m 23s"