4  Lab 2: From .bcl to count matrix

Notes

The estimated time for this lab is around 1h15.

Aims
  • Learn how to demultiplex sequencing data using cellranger mkfastq.
  • Learn how to generate gene count matrices using cellranger count.

4.1 Demultiplexing sequencing data with cellranger mkfastq

Navigate to your terminal in RStudio on AWS.

Go to the cellranger mkfastq page and read the Overview.

For this workshop, we are going to look at a toy dataset provided by 10x Genomics. The dataset is cellranger-tiny-bcl-1.2.0, provided in the ~/Share/data_wrangling/ folder. With the cellranger-tiny-bcl-1.2.0 dataset, we have a samplesheet file cellranger-tiny-bcl-simple-1.2.0.csv that contains the sample information.

Question

Go to the Terminal tab in your RStudio and take a look at the 10x samplesheet .csv file

bash
cat ~/Share/data_wrangling/cellranger-tiny-bcl-simple-1.2.0.csv
Lane,Sample,Index
1,test_sample,SI-GA-E3
Question

Next, explore the contents of the sequencing directory:

bash
ls -l ~/Share/data_wrangling/cellranger-tiny-bcl-1.2.0
total 32
drwxr-xr-x  6 jacques  staff   192 Oct 29 10:54 Config
drwxr-xr-x  3 jacques  staff    96 Oct 29 10:54 Data
drwxr-xr-x  9 jacques  staff   288 Oct 29 10:54 InterOp
-rw-r--r--  1 jacques  staff    48 Jun  7  2023 RTAComplete.txt
-rw-r--r--  1 jacques  staff   673 Jun  7  2023 RunInfo.xml
-rw-r--r--  1 jacques  staff  5366 Jun  7  2023 runParameters.xml

Now we can demultiplex our bcl files by running the following command in the terminal. To do that, we will use the cellranger mkfastq command.

bash
cellranger mkfastq --help
/opt/cellranger/bin

cellranger mkfastq (cellranger-5.0.1)
Copyright (c) 2020 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Run Illumina demultiplexer on sample sheets that contain 10x-specific sample 
index sets, and generate 10x-specific quality metrics after the demultiplex.  
Any bcl2fastq argument will work (except a few that are set by the pipeline 
to ensure proper trimming and sample indexing). The FASTQ output generated 
will be the same as when running bcl2fastq directly.

These bcl2fastq arguments are overridden by this pipeline:
    --fastq-cluster-count
    --minimum-trimmed-read-length
    --mask-short-adapter-reads

Usage:
    cellranger mkfastq --run=PATH [options]
    cellranger mkfastq -h | --help | --version

Required:
    --run=PATH          Path of Illumina BCL run folder.

Optional:
# Sample Sheet
    --id=NAME           Name of the folder created by mkfastq. If not supplied,
                            will default to the name of the flowcell referred to
                            by the --run argument.
    --csv=PATH
    --samplesheet=PATH
    --sample-sheet=PATH
                        Path to the sample sheet. The sample sheet can either be
                            a simple CSV with lane, sample and index columns, or
                            an Illumina Experiment Manager-compatible sample
                            sheet. Sample sheet indexes can refer to 10x sample
                            index set names (e.g., SI-GA-A12).
    --simple-csv=PATH   Deprecated. Same meaning as --csv.
    --qc                Deprecated: will be removed in a future version.
                        Calculate both sequencing and 10x-specific metrics,
                            including per-sample barcode matching rate. Will not
                            be performed unless this flag is specified.
    --force-single-index
                        If 10x-supplied i7/i5 paired indices are specified,
                            but the flowcell was run with only one sample
                            index, allow the demultiplex to proceed using
                                the i7 half of the sample index pair.
    --filter-single-index
                        Only demultiplex samples identified
                            by an i7-only sample index, ignoring dual-indexed
                            samples.  Dual-indexed samples will not be
                            demultiplexed.
    --filter-dual-index
                        Only demultiplex samples identified
                          by i7/i5 dual-indices (e.g., SI-TT-A6), ignoring single-
                          index samples.  Single-index samples will not be 
                          demultiplexed.
    --rc-i2-override=BOOL
                        Indicates if the bases in the I2 read are emitted as 
                          reverse complement by the sequencing workflow.
                          Set to 'true' for Workflow A / NovaSeq Reagent Kit v1.5
                          or greater. Set to 'false' for Wokflow B / older NovaSeq
                          Reagent Kits. NOTE: this parameter is autodetected based 
                          and should only be passed in special circumstances.
    
# bcl2fastq Pass-Through
    --lanes=NUMS        Comma-delimited series of lanes to demultiplex. Shortcut
                            for the --tiles argument.
    --use-bases-mask=MASK
                        Same as bcl2fastq; override the read lengths as
                            specified in RunInfo.xml. See Illumina bcl2fastq
                            documentation for more information.
    --delete-undetermined
                        Delete the Undetermined FASTQ files left by bcl2fastq
                            Useful if your sample sheet is only expected to
                            match a subset of the flowcell.
    --output-dir=PATH   Same as in bcl2fastq. Folder where FASTQs, reports and
                            stats will be generated.
    --project=NAME      Custom project name, to override the samplesheet or to
                            use in conjunction with the --csv argument.

# Martian Runtime
    --jobmode=MODE      Job manager to use. Valid options: local (default), sge,
                            lsf, or a .template file
    --localcores=NUM    Set max cores the pipeline may request at one time. Only
                            applies to local jobs.
    --localmem=NUM      Set max GB the pipeline may request at one time. Only
                            applies to local jobs.
    --localvmem=NUM     Set max virtual address space in GB for the pipeline.
                            Only applies to local jobs.
    --mempercore=NUM    Reserve enough threads for each job to ensure enough
                        memory will be available, assuming each core on your
                        cluster has at least this much memory available. Only
                            applies in cluster jobmodes.
    --maxjobs=NUM       Set max jobs submitted to cluster at one time. Only
                            applies in cluster jobmodes.
    --jobinterval=NUM   Set delay between submitting jobs to cluster, in ms.
                            Only applies in cluster jobmodes.
    --overrides=PATH    The path to a JSON file that specifies stage-level
                            overrides for cores and memory. Finer-grained
                            than --localcores, --mempercore and --localmem.
                            Consult the 10x support website for an example
                            override file.

    --uiport=PORT       Serve web UI at http://localhost:PORT
    --disable-ui        Do not serve the UI.
    --noexit            Keep web UI running after pipestance completes or fails.
    --nopreflight       Skip preflight checks.

    -h --help           Show this message.
    --version           Show version.

If you demultiplexed with 'cellranger mkfastq' or directly with Illumina
bcl2fastq, then set --fastqs to the project folder containing FASTQ files. In
addition, set --sample to the name prefixed to the FASTQ files comprising
your sample. For example, if your FASTQs are named:
    subject1_S1_L001_R1_001.fastq.gz
then set --sample=subject1
Question

What is the purpose of --id, --run, and --csv arguments ?

Choose these arguments wisely and execute the command to demultiplex the sequencing data.

bash
cellranger mkfastq --id tiny-bcl --run ~/Share/data_wrangling/cellranger-tiny-bcl-1.2.0 --csv ~/Share/data_wrangling/cellranger-tiny-bcl-simple-1.2.0.csv
/opt/cellranger/bin
cellranger mkfastq (cellranger-5.0.1)
Copyright (c) 2020 10x Genomics, Inc.  All rights reserved.
-------------------------------------------------------------------------------

Martian Runtime - v4.0.2
Serving UI at http://alcide:36825?auth=0r06HTBMZxdlUuIo9SDR6HlYjSyn7ohyqq9S6lsPlYA

Running preflight checks (please wait)...
Checking run folder...
Checking RunInfo.xml...
Checking system environment...
Emitting run information...
Checking read specification...
Checking samplesheet specs...
2024-10-29 11:42:56 [runtime] (ready)           ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET
2024-10-29 11:42:56 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.PREPARE_SAMPLESHEET.fork0.chnk0.main
...
2024-10-29 11:43:00 [runtime] (split_complete)  ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.MERGE_FASTQS_BY_LANE_SAMPLE
2024-10-29 11:43:00 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.MERGE_FASTQS_BY_LANE_SAMPLE.fork0.chnk0.main
2024-10-29 11:43:00 [runtime] (chunks_complete) ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.MERGE_FASTQS_BY_LANE_SAMPLE
2024-10-29 11:43:00 [runtime] (run:local)       ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.MERGE_FASTQS_BY_LANE_SAMPLE.fork0.join
2024-10-29 11:43:00 [runtime] (join_complete)   ID.tiny-bcl.MAKE_FASTQS_CS.MAKE_FASTQS.MERGE_FASTQS_BY_LANE_SAMPLE

Outputs:
- Run QC metrics:        null
- FASTQ output folder:   /home/rsg/tiny-bcl/outs/fastq_path
- Interop output folder: /home/rsg/tiny-bcl/outs/interop_path
- Input samplesheet:     /home/rsg/tiny-bcl/outs/input_samplesheet.csv

Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

2024-10-29 11:43:06 Shutting down.
Saving pipestance info to "tiny-bcl/tiny-bcl.mri.tgz"

The output folders can be viewed by running the ls command:

bash
ls -l ~/tiny-bcl/outs/fastq_path/H35KCBCXY/test_sample
total 96296
-rw-r--r--  1 jacques  staff   3930346 Oct 30 23:46 test_sample_S1_L001_I1_001.fastq.gz
-rw-r--r--  1 jacques  staff  11093755 Oct 30 23:46 test_sample_S1_L001_R1_001.fastq.gz
-rw-r--r--  1 jacques  staff  34273874 Oct 30 23:46 test_sample_S1_L001_R2_001.fastq.gz
Question

Look at the first reads listed in index read (I1), read 1 (R1), and read (R2) files.

What is the purpose of the index read? What does it contain?

bash
zcat tiny-bcl/outs/fastq_path/H35KCBCXY/test_sample/test_sample_S1_L001_I1_001.fastq.gz | head -n 12
@D00547:905:H35KCBCXY:1:1101:1284:1973 1:N:0:AGGTATTG
AGGTATTG
+
GGGGGIII
@D00547:905:H35KCBCXY:1:1101:1729:1981 1:N:0:AGGTATTG
AGGTATTG
+
GAGGGIIG
@D00547:905:H35KCBCXY:1:1101:2005:1944 1:N:0:AGGTATTG
AGGTATTG
+
AA<G.A.G

Open the html file tiny-bcl/outs/fastq_path/Reports/html/index.html by navigating to the file in RStudio, using the Files Tab. When you click on the file, select the option to View in Web Browser. Take some time to explore the demultiplexed outputs.

4.2 Generating gene count matrices with cellranger count

Go to the cellranger count algorithm overview and read the section on Alignment (Read Trimming, Genome Alignment, MAPQ adjustment, Transcriptome Alignment, UMI Counting).

In bash, the cellranger count command is used to generate gene count matrices from the demultiplexed reads.

bash
cellranger count --help
cellranger-count 
Count gene expression (targeted or whole-transcriptome) and/or feature barcode reads from a single sample and GEM well

USAGE:
    cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>

FLAGS:
        --no-target-umi-filter    Turn off the target UMI filtering subpipeline
        --no-bam                  Do not generate a bam file
        --nosecondary             Disable secondary analysis, e.g. clustering. Optional
        --include-introns         Include intronic reads in count
        --no-libraries            Proceed with processing using a --feature-ref but no Feature Barcode libraries specified with the 'libraries' flag
        --dry                     Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop
        --disable-ui              Do not serve the web UI
        --noexit                  Keep web UI running after pipestance completes or fails
        --nopreflight             Skip preflight checks
    -h, --help                    Prints help information

OPTIONS:
        --id <ID>                 A unique run id and output folder name [a-zA-Z0-9_-]+
        --description <TEXT>      Sample description to embed in output files
        --transcriptome <PATH>    Path of folder containing 10x-compatible transcriptome reference
        --fastqs <PATH>...        Path to input FASTQ data
        --project <TEXT>          Name of the project folder within a mkfastq or bcl2fastq-generated folder to pick FASTQs from
        --sample <PREFIX>...      Prefix of the filenames of FASTQs to select
        --lanes <NUMS>...         Only use FASTQs from selected lanes
        --libraries <CSV>         CSV file declaring input library data sources
        --feature-ref <CSV>       Feature reference CSV file, declaring Feature Barcode constructs and associated barcodes
        --target-panel <CSV>      The target panel CSV file declaring the target panel used, if any
        --expect-cells <NUM>      Expected number of recovered cells
        --force-cells <NUM>       Force pipeline to use this number of cells, bypassing cell detection
        --r1-length <NUM>         Hard trim the input Read 1 to this length before analysis
        --r2-length <NUM>         Hard trim the input Read 2 to this length before analysis
        --chemistry <CHEM>        Assay configuration. NOTE: by default the assay configuration is detected automatically, which is the recommened mode. You usually will not
                                  need to specify a chemistry. Options are: 'auto' for autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for  Single Cell 5',
                                  'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5', paired-end/R2-only, 'SC-FB' for
                                  Single Cell Antibody-only 3' v2 or 5' [default: auto]
        --jobmode <MODE>          Job manager to use. Valid options: local (default), sge, lsf, slurm or a .template file. Search for help on "Cluster Mode" at
                                  support.10xgenomics.com for more details on configuring the pipeline to use a compute cluster [default: local]
        --localcores <NUM>        Set max cores the pipeline may request at one time. Only applies to local jobs
        --localmem <NUM>          Set max GB the pipeline may request at one time. Only applies to local jobs
        --localvmem <NUM>         Set max virtual address space in GB for the pipeline. Only applies to local jobs
        --mempercore <NUM>        Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much
                                  memory available. Only applies in cluster jobmodes
        --maxjobs <NUM>           Set max jobs submitted to cluster at one time. Only applies in cluster jobmodes
        --jobinterval <NUM>       Set delay between submitting jobs to cluster, in ms. Only applies in cluster jobmodes
        --overrides <PATH>        The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and
                                  --localmem. Consult the 10x support website for an example override file
        --uiport <PORT>           Serve web UI at http://localhost:PORT

Now, we will generate a gene count matrix for the test_sample reads using cellranger count and the transcriptome reference provided in ~/Share/refdata-gex-mm10-2020-A/

Question

What is the purpose of the --id, --transcriptome, --fastqs, --sample, and --create-bam arguments?

bash
cellranger count --id counts --transcriptome ~/Share/refdata-gex-mm10-2020-A/ --fastqs tiny-bcl/outs/fastq_path/H35KCBCXY/test_sample --sample test_sample --create-bam true
Martian Runtime - v4.0.2
Serving UI at http://alcide:35163?auth=IERAjMung3pm8corwsthsKjEue66fJwPDK0-5l-CLCo

Running preflight checks (please wait)...

Checking sample info...
Checking FASTQ folder...
Checking reference...
Checking reference_path (~/Share/refdata-gex-mm10-2020-A) on alcide...
Checking chemistry...
Checking optional arguments...

mrc: v4.0.2

mrp: v4.0.2

Anaconda: 
numpy: 1.15.4

scipy: 1.1.0

pysam: 0.16.0.1

h5py: 2.8.0

pandas: 0.24.2

STAR: 2.7.2a

samtools: samtools 1.10
Using htslib 1.10.2
Copyright (C) 2019 Genome Research Ltd.

2024-10-29 17:18:47 [runtime] (ready)           ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MAKE_FULL_CONFIG._MAKE_VDJ_CONFIG
2024-10-29 17:18:47 [runtime] (run:local)       ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MAKE_FULL_CONFIG._MAKE_VDJ_CONFIG.fork0.chnk0.main
2024-10-29 17:18:47 [runtime] (ready)           ID.counts.SC_RNA_COUNTER_CS.FULL_COUNT_INPUTS.WRITE_GENE_INDEX

...
...
...

2024-10-29 17:25:30 [runtime] (run:local)       ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MULTI_REPORTER.CLOUPE_PREPROCESS.fork0.chnk0.main
2024-10-29 17:25:32 [runtime] (chunks_complete) ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MULTI_REPORTER.CLOUPE_PREPROCESS
2024-10-29 17:25:32 [runtime] (run:local)       ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MULTI_REPORTER.CLOUPE_PREPROCESS.fork0.join
2024-10-29 17:25:33 [runtime] (join_complete)   ID.counts.SC_RNA_COUNTER_CS.SC_MULTI_CORE.MULTI_REPORTER.CLOUPE_PREPROCESS

Outputs:
- Run summary HTML:                         /home/rsg/counts/outs/web_summary.html
- Run summary CSV:                          /home/rsg/counts/outs/metrics_summary.csv
- BAM:                                      /home/rsg/counts/outs/possorted_genome_bam.bam
- BAM index:                                /home/rsg/counts/outs/possorted_genome_bam.bam.bai
- Filtered feature-barcode matrices MEX:    /home/rsg/counts/outs/filtered_feature_bc_matrix
- Filtered feature-barcode matrices HDF5:   /home/rsg/counts/outs/filtered_feature_bc_matrix.h5
- Unfiltered feature-barcode matrices MEX:  /home/rsg/counts/outs/raw_feature_bc_matrix
- Unfiltered feature-barcode matrices HDF5: /home/rsg/counts/outs/raw_feature_bc_matrix.h5
- Secondary analysis output CSV:            /home/rsg/counts/outs/analysis
- Per-molecule read information:            /home/rsg/counts/outs/molecule_info.h5
- CRISPR-specific analysis:                 null
- Loupe Browser file:                       /home/rsg/counts/outs/cloupe.cloupe
- Feature Reference:                        null
- Target Panel File:                        null

Waiting 6 seconds for UI to do final refresh.
Pipestance completed successfully!

2024-10-29 17:25:39 Shutting down.

While the count command is running, read about the format of the feature-barcode matrices.

Once the count command is finished running, the pipeline outputs can be viewed with ls:

bash
ls -l ~/counts/outs
total 119792
drwxr-xr-x  7 jacques  staff       224 Oct 30 23:51 analysis
-rw-r--r--  1 jacques  staff   2522877 Oct 30 23:51 cloupe.cloupe
drwxr-xr-x  5 jacques  staff       160 Oct 30 23:51 filtered_feature_bc_matrix
-rw-r--r--  1 jacques  staff    489402 Oct 30 23:51 filtered_feature_bc_matrix.h5
-rw-r--r--  1 jacques  staff       627 Oct 30 23:51 metrics_summary.csv
-rw-r--r--  1 jacques  staff   2885171 Oct 30 23:51 molecule_info.h5
-rw-r--r--  1 jacques  staff  47264234 Oct 30 23:51 possorted_genome_bam.bam
-rw-r--r--  1 jacques  staff   1744960 Oct 30 23:51 possorted_genome_bam.bam.bai
drwxr-xr-x  5 jacques  staff       160 Oct 30 23:51 raw_feature_bc_matrix
-rw-r--r--  1 jacques  staff   3557585 Oct 30 23:51 raw_feature_bc_matrix.h5
-rw-r--r--  1 jacques  staff   2848576 Oct 30 23:51 web_summary.html

Can you locate the feature-barcode matrices? What is the difference between the raw_feature_bc_matrix and filtered_feature_bc_matrix data types?

Now open the html file counts/outs/web_summary.html by navigating to the file in RStudio, using the Files Tab. When you click on the file, select the option to View in Web Browser. Take some time to explore the gene expression matrix outputs.