bash
cat ~/Share/data_wrangling/cellranger-tiny-bcl-simple-1.2.0.csv
Lane,Sample,Index
1,test_sample,SI-GA-E3
The estimated time for this lab is around 1h15.
cellranger mkfastq
.cellranger count
.cellranger mkfastq
Navigate to your terminal in RStudio
on AWS.
Go to the cellranger mkfastq page and read the Overview.
For this workshop, we are going to look at a toy dataset provided by 10x Genomics. The dataset is cellranger-tiny-bcl-1.2.0
, provided in the ~/Share/data_wrangling/
folder. With the cellranger-tiny-bcl-1.2.0
dataset, we have a samplesheet file cellranger-tiny-bcl-simple-1.2.0.csv
that contains the sample information.
Now we can demultiplex our bcl files by running the following command in the terminal. To do that, we will use the cellranger mkfastq
command.
bash
cellranger mkfastq --help
/opt/cellranger/bin
cellranger mkfastq (cellranger-5.0.1)
Copyright (c) 2020 10x Genomics, Inc. All rights reserved.
-------------------------------------------------------------------------------
Run Illumina demultiplexer on sample sheets that contain 10x-specific sample
index sets, and generate 10x-specific quality metrics after the demultiplex.
Any bcl2fastq argument will work (except a few that are set by the pipeline
to ensure proper trimming and sample indexing). The FASTQ output generated
will be the same as when running bcl2fastq directly.
These bcl2fastq arguments are overridden by this pipeline:
--fastq-cluster-count
--minimum-trimmed-read-length
--mask-short-adapter-reads
Usage:
cellranger mkfastq --run=PATH [options]
cellranger mkfastq -h | --help | --version
Required:
--run=PATH Path of Illumina BCL run folder.
Optional:
# Sample Sheet
--id=NAME Name of the folder created by mkfastq. If not supplied,
will default to the name of the flowcell referred to
by the --run argument.
--csv=PATH
--samplesheet=PATH
--sample-sheet=PATH
Path to the sample sheet. The sample sheet can either be
a simple CSV with lane, sample and index columns, or
an Illumina Experiment Manager-compatible sample
sheet. Sample sheet indexes can refer to 10x sample
index set names (e.g., SI-GA-A12).
--simple-csv=PATH Deprecated. Same meaning as --csv.
--qc Deprecated: will be removed in a future version.
Calculate both sequencing and 10x-specific metrics,
including per-sample barcode matching rate. Will not
be performed unless this flag is specified.
--force-single-index
If 10x-supplied i7/i5 paired indices are specified,
but the flowcell was run with only one sample
index, allow the demultiplex to proceed using
the i7 half of the sample index pair.
--filter-single-index
Only demultiplex samples identified
by an i7-only sample index, ignoring dual-indexed
samples. Dual-indexed samples will not be
demultiplexed.
--filter-dual-index
Only demultiplex samples identified
by i7/i5 dual-indices (e.g., SI-TT-A6), ignoring single-
index samples. Single-index samples will not be
demultiplexed.
--rc-i2-override=BOOL
Indicates if the bases in the I2 read are emitted as
reverse complement by the sequencing workflow.
Set to 'true' for Workflow A / NovaSeq Reagent Kit v1.5
or greater. Set to 'false' for Wokflow B / older NovaSeq
Reagent Kits. NOTE: this parameter is autodetected based
and should only be passed in special circumstances.
# bcl2fastq Pass-Through
--lanes=NUMS Comma-delimited series of lanes to demultiplex. Shortcut
for the --tiles argument.
--use-bases-mask=MASK
Same as bcl2fastq; override the read lengths as
specified in RunInfo.xml. See Illumina bcl2fastq
documentation for more information.
--delete-undetermined
Delete the Undetermined FASTQ files left by bcl2fastq
Useful if your sample sheet is only expected to
match a subset of the flowcell.
--output-dir=PATH Same as in bcl2fastq. Folder where FASTQs, reports and
stats will be generated.
--project=NAME Custom project name, to override the samplesheet or to
use in conjunction with the --csv argument.
# Martian Runtime
--jobmode=MODE Job manager to use. Valid options: local (default), sge,
lsf, or a .template file
--localcores=NUM Set max cores the pipeline may request at one time. Only
applies to local jobs.
--localmem=NUM Set max GB the pipeline may request at one time. Only
applies to local jobs.
--localvmem=NUM Set max virtual address space in GB for the pipeline.
Only applies to local jobs.
--mempercore=NUM Reserve enough threads for each job to ensure enough
memory will be available, assuming each core on your
cluster has at least this much memory available. Only
applies in cluster jobmodes.
--maxjobs=NUM Set max jobs submitted to cluster at one time. Only
applies in cluster jobmodes.
--jobinterval=NUM Set delay between submitting jobs to cluster, in ms.
Only applies in cluster jobmodes.
--overrides=PATH The path to a JSON file that specifies stage-level
overrides for cores and memory. Finer-grained
than --localcores, --mempercore and --localmem.
Consult the 10x support website for an example
override file.
--uiport=PORT Serve web UI at http://localhost:PORT
--disable-ui Do not serve the UI.
--noexit Keep web UI running after pipestance completes or fails.
--nopreflight Skip preflight checks.
-h --help Show this message.
--version Show version.
If you demultiplexed with 'cellranger mkfastq' or directly with Illumina
bcl2fastq, then set --fastqs to the project folder containing FASTQ files. In
addition, set --sample to the name prefixed to the FASTQ files comprising
your sample. For example, if your FASTQs are named:
subject1_S1_L001_R1_001.fastq.gz
then set --sample=subject1
The output folders can be viewed by running the ls
command:
bash
ls -l ~/tiny-bcl/outs/fastq_path/H35KCBCXY/test_sample
total 96296
-rw-r--r-- 1 jacques staff 3930346 Oct 30 23:46 test_sample_S1_L001_I1_001.fastq.gz
-rw-r--r-- 1 jacques staff 11093755 Oct 30 23:46 test_sample_S1_L001_R1_001.fastq.gz
-rw-r--r-- 1 jacques staff 34273874 Oct 30 23:46 test_sample_S1_L001_R2_001.fastq.gz
Open the html file tiny-bcl/outs/fastq_path/Reports/html/index.html
by navigating to the file in RStudio, using the Files Tab. When you click on the file, select the option to View in Web Browser. Take some time to explore the demultiplexed outputs.
cellranger count
Go to the cellranger count algorithm overview and read the section on Alignment (Read Trimming, Genome Alignment, MAPQ adjustment, Transcriptome Alignment, UMI Counting).
In bash, the cellranger count
command is used to generate gene count matrices from the demultiplexed reads.
bash
cellranger count --help
cellranger-count
Count gene expression (targeted or whole-transcriptome) and/or feature barcode reads from a single sample and GEM well
USAGE:
cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>
FLAGS:
--no-target-umi-filter Turn off the target UMI filtering subpipeline
--no-bam Do not generate a bam file
--nosecondary Disable secondary analysis, e.g. clustering. Optional
--include-introns Include intronic reads in count
--no-libraries Proceed with processing using a --feature-ref but no Feature Barcode libraries specified with the 'libraries' flag
--dry Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop
--disable-ui Do not serve the web UI
--noexit Keep web UI running after pipestance completes or fails
--nopreflight Skip preflight checks
-h, --help Prints help information
OPTIONS:
--id <ID> A unique run id and output folder name [a-zA-Z0-9_-]+
--description <TEXT> Sample description to embed in output files
--transcriptome <PATH> Path of folder containing 10x-compatible transcriptome reference
--fastqs <PATH>... Path to input FASTQ data
--project <TEXT> Name of the project folder within a mkfastq or bcl2fastq-generated folder to pick FASTQs from
--sample <PREFIX>... Prefix of the filenames of FASTQs to select
--lanes <NUMS>... Only use FASTQs from selected lanes
--libraries <CSV> CSV file declaring input library data sources
--feature-ref <CSV> Feature reference CSV file, declaring Feature Barcode constructs and associated barcodes
--target-panel <CSV> The target panel CSV file declaring the target panel used, if any
--expect-cells <NUM> Expected number of recovered cells
--force-cells <NUM> Force pipeline to use this number of cells, bypassing cell detection
--r1-length <NUM> Hard trim the input Read 1 to this length before analysis
--r2-length <NUM> Hard trim the input Read 2 to this length before analysis
--chemistry <CHEM> Assay configuration. NOTE: by default the assay configuration is detected automatically, which is the recommened mode. You usually will not
need to specify a chemistry. Options are: 'auto' for autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for Single Cell 5',
'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5', paired-end/R2-only, 'SC-FB' for
Single Cell Antibody-only 3' v2 or 5' [default: auto]
--jobmode <MODE> Job manager to use. Valid options: local (default), sge, lsf, slurm or a .template file. Search for help on "Cluster Mode" at
support.10xgenomics.com for more details on configuring the pipeline to use a compute cluster [default: local]
--localcores <NUM> Set max cores the pipeline may request at one time. Only applies to local jobs
--localmem <NUM> Set max GB the pipeline may request at one time. Only applies to local jobs
--localvmem <NUM> Set max virtual address space in GB for the pipeline. Only applies to local jobs
--mempercore <NUM> Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much
memory available. Only applies in cluster jobmodes
--maxjobs <NUM> Set max jobs submitted to cluster at one time. Only applies in cluster jobmodes
--jobinterval <NUM> Set delay between submitting jobs to cluster, in ms. Only applies in cluster jobmodes
--overrides <PATH> The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and
--localmem. Consult the 10x support website for an example override file
--uiport <PORT> Serve web UI at http://localhost:PORT
Now, we will generate a gene count matrix for the test_sample
reads using cellranger count
and the transcriptome reference provided in ~/Share/refdata-gex-mm10-2020-A/
While the count command is running, read about the format of the feature-barcode matrices.
Once the count command is finished running, the pipeline outputs can be viewed with ls
:
bash
ls -l ~/counts/outs
total 119792
drwxr-xr-x 7 jacques staff 224 Oct 30 23:51 analysis
-rw-r--r-- 1 jacques staff 2522877 Oct 30 23:51 cloupe.cloupe
drwxr-xr-x 5 jacques staff 160 Oct 30 23:51 filtered_feature_bc_matrix
-rw-r--r-- 1 jacques staff 489402 Oct 30 23:51 filtered_feature_bc_matrix.h5
-rw-r--r-- 1 jacques staff 627 Oct 30 23:51 metrics_summary.csv
-rw-r--r-- 1 jacques staff 2885171 Oct 30 23:51 molecule_info.h5
-rw-r--r-- 1 jacques staff 47264234 Oct 30 23:51 possorted_genome_bam.bam
-rw-r--r-- 1 jacques staff 1744960 Oct 30 23:51 possorted_genome_bam.bam.bai
drwxr-xr-x 5 jacques staff 160 Oct 30 23:51 raw_feature_bc_matrix
-rw-r--r-- 1 jacques staff 3557585 Oct 30 23:51 raw_feature_bc_matrix.h5
-rw-r--r-- 1 jacques staff 2848576 Oct 30 23:51 web_summary.html
Can you locate the feature-barcode matrices? What is the difference between the raw_feature_bc_matrix
and filtered_feature_bc_matrix
data types?
Now open the html file counts/outs/web_summary.html
by navigating to the file in RStudio, using the Files Tab. When you click on the file, select the option to View in Web Browser. Take some time to explore the gene expression matrix outputs.