6 Lab 3: Introduction to R/Bioconductor
6.1 Installing packages in R
“Hey, I’ve heard so many good things about this piece of software, it’s called ‘slingshot’? Heard of it? I really want to try it out on my dataset!!”
Or, in other words: “how do I install this or that brand new cutting-edge fancy package?”
R works with packages, available from different sources:
-
CRAN, theRdeveloper team and official package provider: CRAN (which can probably win the title of “the worst webpage ever designed that survived until 2023”). -
Bioconductor, another package provider, with a primary focus on genomic-related packages: Bioconductor. - Other unofficial sources, such as
GitHub.
Let’s start by going over package installation.
- Install
mgcv,HCADataandreveliopackages
Each of these three packages is available from a different source.
R
install.packages('mgcv')
BiocManager::install('HCAData')
remotes::install_github('danielschw188/revelio')- Check package help pages
Package help pages are available at different places, depending on their source. That being said, there is a place I like to go to easily find information related to most packages:
For instance, check out Revelio package help pages.
- What is this package designed for?
- What are its main functions? What type of input does it require?
6.2 Basic R and Bioconductor classes
While CRAN is a repository of general-purpose packages, Bioconductor is the greatest source of analytical tools, data and workflows dedicated to genomic projects in R. Read more about Bioconductor to fully understand how it builds up on top of R general features, especially with the specific classes it introduces.
The two main concepts behind Bioconductor’s success are the non-redundant classes of objects it provides and their inter-operability. Huber et al., Nat. Methods 2015 summarizes it well.
6.2.1 Important R concepts:
6.2.1.1 tibble tables:
tibbles are built on the fundamental data.frame objects. They follow “tidy” concepts, all gathered in a common tidyverse. This set of key concepts help general data investigation and data visualization through a set of associated packages such as ggplot2.
- Import a text file into
tibbles
tibbles can be created from text files (or Excel files) using the readr package (part of tidyverse)
6.2.1.2 Handling of tibbles:
tibbles can be readily “sliced” (i.e. selecting rows by number/name), “filtered” (i.e. selecting rows by condition) and columns can be “selected”. All these operations are performed using verbs (most of them provided by the dplyr package, part of tidyverse).
Columns can also be quickly added/modified using the mutate verb.
6.2.1.3 |> pipe:
Actions on tibbles can be piped as a chain with |>, just like | pipes stdout as the stdin of the next command in bash. In this case, the first argument is always the output of the previous function and is ommited. Because tidyverse functions generally return a modified version of the input, pipping works remarkably well in such context.
6.2.2 Important Bioconductor concepts:
6.2.2.1 SummarizedExperiment class:
The most fundamental class used to hold the content of large-scale quantitative analyses, such as counts of RNA-seq experiments, or high-throughput cytometry experiments or proteomics experiments.

Make sure you understand the structure of objects from this class. A dedicated workshop that I would recommend quickly going over is available here. Generally speaking, a SummarizedExperiment object contains matrix-like objects (the assays), with rows representing features (e.g. genes, transcripts, …) and each column representing a sample. Information specific to genes and samples are stored in “parallel” data frames, for example to store gene locations, tissue of expression, biotypes (for genes) or batch, generation date, or machine ID (for samples). On top of that, metadata are also stored in the object (to store description of a project, …).
An important difference with S3 list-like objects usually used in R is that most of the underlying data (organized in precisely structured "slots") is accessed using getter functions, rather than the familiar $ or [. Here are some important getters:
-
assay(),assays(): Extrant matrix-like or list of matrix-like objects of identical dimensions. Since the objects arematrix-like,dim(),dimnames(), and 2-dimensional[,[<-methods are available. -
colData(): Annotations on each column (as a DataFrame): usually, description of each sample -
rowData(): Annotations on each row (as a DataFrame): usually, description of each gene -
metadata(): List of unstructured metadata describing the overall content of the object.
Let’s dig into an example (you may need to install the airway package from Bioconductor…)
R
library(SummarizedExperiment)
library(airway)
data(airway)
airway- What are the dimensions of the dataset? What type of quantitative data is stored? Which features are assessed?
- Can you create a subset of the data corresponding to
LRGgenes in untreated samples?
6.2.2.2 GenomicRanges class (a.k.a. GRanges):
GenomicRanges are a type of IntervalRanges, they are useful to describe genomic intervals. Each entry in a GRanges object has a seqnames(), a start() and an end() coordinates, a strand(), as well as associated metadata (mcols()). They can be built from scratch using tibbles converted with makeGRangesFromDataFrame().
R
library(GenomicRanges)
gr <- read_tsv('~/Share/GSM4486714_AXH009_genes.tsv', col_names = c('ID', 'Symbol')) |>
mutate(
chr = sample(1:22, n(), replace = TRUE),
start = sample(1:1000, n(), replace = TRUE),
end = sample(10000:20000, n(), replace = TRUE),
strand = sample(c('-', '+'), n(), replace = TRUE)
) |>
makeGRangesFromDataFrame(keep.extra.columns = TRUE)
gr
mcols(gr)Just like tidyverse in R, tidy functions are provided for GRanges by the plyranges package.
- Can you find a way to easily read common input files such as
bedfiles intoGRanges?
R
library(rtracklayer)
genes2 <- import('~/Share/GRCm39_genes.bed')
genes2- How would you have proceeded without
rtracklayer? Check thestartcoordinates: what do you see? Comment on the interest of usingBioconductor.
6.3 CRAN & Bioconductor approaches to scRNAseq
6.3.1 scRNAseq in Bioconductor
For single-cell RNA-seq projects, Bioconductor has been introducting new classes and standards very rapidly in the past few years. Notably, several packages are increasingly becoming central for single-cell analysis:
SingleCellExperimentscaterscranscuttlebatchelorSingleRblusterDropletUtilsslingshottradeSeq- …
SingleCellExperiment is the fundamental class designed to contain single-cell (RNA-seq) data in Bioconductor ecosystem. It is a modified version of the SummarizedExperiment object, so most of the getters/setters are shared with this class.

Several slots can be accessed in a SingleCellExperiment object, just like the SummarizedExperiment object it’s been adapted from:
Quantitative metrics for scRNAseq studies can also be stored in assays:
- Check the
colData()output of thesceobject. What information is stored there? How can you access the different objects stored incolData?
- Are there any reduced dimensionality representation of the data stored in the
sceobject? How can we run aPCAusing normalized counts?
- Now, let’s compute a
UMAPembedding from thisPCAand compare it to thePCAembedding.
We will see more advanced ways of reducing scRNAseq data dimensionality in the coming lectures and labs.
6.3.2 scRNAseq in R
Seurat is another very popular ecosystem to investigate scRNAseq data. It is primarily developed and maintained by the Sajita Lab. It originally begun as a single package aiming at encompassing “all” (most) aspects of scRNAseq analysis. However, it rapidly evolved in a much larger project, and now operates along with other “wrappers” and extensions. It also has a very extended support from the lab group. All in all, is provides a (somewhat) simple workflow to start investigating scRNAseq data.
It is important to chose one standard that you feel comfortable with yourself. Which standard provides the most intuitive approach for you? Do you prefer an “all-in-one, plug-n-play” workflow (Seurat-style), or a modular approach (Bioconductor-style)? Which documentation is easier to read for you, a central full-featured website with extensive examples (Seurat-style), or “programmatic”-style vignettes (Bioconductor-style)?
This course will mostly rely on Bioconductor-based methods, but sometimes use Seurat-based methods.s In the absence of coordination of data structures, the next best solution is to write functions to coerce an object from a certain class to another class (i.e. Seurat to SingleCellExperiment, or vice-versa). Luckily, this is quite straightforward in R for these 2 data classes:
R
sce_seurat <- Seurat::as.Seurat(sce)
str(sce)
sce
sce_seurat
sce2 <- Seurat::as.SingleCellExperiment(sce_seurat)- Do you see any change between
sceand the corresponding, “back-converted”,sce2objects? Explain these differences.
- Try and access the underlying raw or normalized data from the
sce_seuratobject. How does it compare to data access from anSingleCellExperimentobject?
R
colnames(sce_seurat)
ncol(sce_seurat)
nrow(sce_seurat)
# cells and features access
head(Seurat::Cells(sce_seurat))
head(rownames(sce_seurat))
# cell data access
head(sce_seurat[[]])
head(sce_seurat$label)
# Counts access
Seurat::GetAssayData(object = sce_seurat, slot = "counts")[1:10, 1:10]
Seurat::GetAssayData(object = sce_seurat, slot = "data")[1:10, 1:10]
# Embeddings
head(Seurat::Embeddings(object = sce_seurat, reduction = "diffusion"))6.4 Reading scRNAseq data
- Try to load the raw
10Xsingle-cell RNA-seq data downloaded yesterday (fromLier et al.) into aSingleCellExperimentobject usingDropletUtilspackage.
Public single-cell RNA-seq data can be retrieved from within R directly, thanks to several data packages, for instance scRNAseq or HCAData.
- Check out the
He et al., Genome Biol. 2020paper. Can you find a way to load the scRNAseq data from this paper without having to leave theRconsole?
R
organs <- scRNAseq::HeOrganAtlasData(ensembl = TRUE)
organsThe interest of this approach is that one can recover a full-fledged SingleCellExperiment (often) provided by the authors of the corresponding study. This means that lots of information, such as batch ID, clustering, cell annotation, etc., may be readily available.
- Check the data available for cells/features in the dataset from
He et al..
6.5 Bonus
To compare the two different approaches, try preparing both a SingleCellExperiment or a Seurat object from scratch, using the matrix files generated in the previous lab. Read the documentation of the two related packages to understand how to do this.
This will be extensively covered in the next lab for everybody.