16  Lab 7: Hi-C data processing

Aims
  • Manually process Hi-C reads into pairs file
  • Filter and deduplicate the pairs file
  • Generate a Hi-C contact map
Datasets

We will process a data from the Koszul lab, generated in 2024 and published in Science.

16.1 Map reads with bwa

fastq files have already been downloaded from SRA, using the following command:

bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR221/072/SRR22130072/SRR22130072_1.fastq.gz -o Share/reads/HiC_R1.fq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR221/072/SRR22130072/SRR22130072_2.fastq.gz -o Share/reads/HiC_R2.fq.gz

bwa-mem2 allows faster mapping than bowtie2. We will use it for Hi-C mapping.

  • Check bwa-mem2 index help to understand how to use it.
  • Build a yeast genome index for bwa-mem2.
bash
bwa-mem2
bwa-mem2 index -p genomes/R64-1-1 Share/genome/R64-1-1.fa
  • Check bwa-mem2 mem help to understand the different options available.
  • Map the reads to the yeast genome with bwa-mem2 mem , and the -SP option.
  • Check the content of the HiC.sam file with samtools stats.
  • How many reads were mapped? How many pairs (and total reads) were there originally?
bash
bwa-mem2 mem
bwa-mem2 mem -SP5M -t 2 \
    genomes/R64-1-1 \
    Share/reads/HiC_R1.fq.gz \
    Share/reads/HiC_R2.fq.gz > HiC.sam
samtools stats HiC.sam | grep ^SN

16.2 Parse sam file into pairs

  • Check pairtools parse documentation
  • Use pairtools parse to convert the sam file into a pairs file. Use the --drop-seq and --drop-sam options to lighten the output file. Also use --output-stats test.stats to generate a summary of the parsing.
  • Check the content of the pairs file with head.
  • How many pairs are there in the pairs file?
  • Check the content of the test.stats file. What is the average mapping quality of the pairs?
bash
pairtools parse \
    -c ~/Share/genome/R64-1-1.chrom.sizes \
    -o HiC.pairs \
    --drop-seq --drop-sam \
    --output-stats test.stats \
    --nproc-in 2 --nproc-out 2 \
    Share/mapping/HiC.sam
head -n 50 HiC.pairs
wc -l HiC.pairs
cat test.stats

16.3 Filter pairs

The following filtering steps should be done to the pairs file:

  • Sort the pairs file

  • Remove duplicates

  • Select only pairs with both ends mapped (ie pair_type=="UU")

  • Process to pairs filtering with pairtools subcommands.

bash
pairtools sort --nproc 2 -o HiC.sorted.pairs Share/pairs/HiC.pairs

pairtools dedup \
    --max-mismatch 3 \
    --mark-dups \
    --output HiC.sorted.dedup.pairs \
    HiC.sorted.pairs

pairtools select 'pair_type=="UU"' HiC.sorted.dedup.pairs -o HiC_filtered.pairs

16.4 Generate contact maps

  • Check cooler cload help to understand how to bin pairs into a contact map.
  • Use cooler cload to generate a contact map from the pairs file. Bin the data at 1kb resolution.
bash
cooler cload pairs \
    -c1 2 -p1 3 -c2 4 -p2 5 \
    ~/Share/genome/R64-1-1.chrom.sizes:1000 \
    Share/pairs/HiC_filtered.pairs \
    HiC.cool
  • Check the cool file structure with cooler tree.
  • Check the content of the HiC.cool metadata with cooler info.
  • How many pairs are in the HiC.cool file? How many non-zero values are there? Comment.
bash
cooler tree ~/Share/maps/HiC.cool
cooler info ~/Share/maps/HiC.cool

16.5 Generate a multi-resolution contact map

  • Check cooler zoomify help to understand how to generate a multi-resolution contact map.
  • Use cooler zoomify to generate a multi-resolution contact map. Don’t forget to include normalization in the output maps.
  • Check the content of the HiC.cool file with cooler info and cooler tree.
bash
cooler zoomify -r 1000N --balance -o HiC.mcool ~/Share/maps/HiC.cool
cooler info HiC.mcool::/resolutions/1000
cooler info HiC.mcool::/resolutions/5000
cooler tree HiC.mcool
Back to top