R
writeLines("http://{{< meta IP >}}:8787")The estimated time for this lab is around 1h15.
RStudio interface.RStudio.wget.Most of single-cell RNA-seq analysis takes place either in python or in R. Here, we focus on how to leverage R to investigate scRNAseq data. RStudio is an IDE (Integrated Development Environment, in other words: a nice graphical interface to run R-related commands).
For this workshop, we have installed R and RStudio on AWS. We can directly use RStudio (actually, RStudio-server since it is installed on an AWS remote server). Simply open a browser and copy-paste the following address:
R
writeLines("http://{{< meta IP >}}:8787")An RStudio log in page will appear; to log in, use your user ID for both ID and password.
Notice how when you log in Rstudio, there are multiple panels. Familiarize yourself with the different panels.
The interactive R console is generally found in the bottom left corner of RStudio(though it may be in another corner sometimes). All the rest (history panel, environment panel, directory explorer panel, editor panel) are extra features provided by RStudio.
R:
Within the R console, you can safely use R-dedicated commands. Do you know the most common ones? The semantics are a different from the terminal commands you may be used to…
R
getwd() # equivalent of `pwd` in terminal
dir.create('~/data/') # equivalent of `mkdir ~/data/` in terminal
setwd("~/data/") # equivalent of `cd ~/data/` in terminal
list.files("~/data/") # equivalent of `ls` in terminal
download.file("...") # equivalent of `wget ...` in terminalA general issue with bioinformatic analyses stems from the fact that nobody works in the same environment:
To ensure that we are all working in the same environment, we rely on AWS (Amazon Web Services) EC2 (Elastic Cloud 2) instances. EC2 instances are “virtual” computers to which you can connect remotely, from a local computer.
The instance is common for everybody. We are thus all sharing the same “computer”; this means:
The easiest way for us to launch bash commands from a terminal in AWS is to do it through RStudio: You can open up a terminal directly from within RStudio as follow: go to Tools > Terminal > New terminal. This should open up a new tab in the bottom left corner (next to the R console).
R console versus terminal:
From here onwards, be sure you completely understand the difference between “R console” and “terminal (or shell)”. They are entirely different things, and can be both accessed within RStudio. It is crucial you understand the difference between the two to not get confused for the rest of the course.
The same bash commands are available in AWS terminal, regardless of whether you access the terminal from RStudio or through ssh.
One can list files, download files, check help pages, …, just like in R.
bash
pwdbash
historyhistory.txt filebash
history > history.txtbash
mkdir datadata directorybash
cd datahistory.txt file into data directorybash
mv ../history.txt ./wget command (hit q to exit)bash
man wgetcellranger command and subcommandsbash
cellranger --help
cellranger count --helpcellranger-help.txt
bash
cellranger count --help > cellranger-help.txtwget
bash
wget https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gzbash
ls -l ~/Share/Download the git repository for this course from GitHub:
bash
git clone https://github.com/js2264/scRNAseq_Physalia_2025.gitThis downloads the repository for this course to your home folder on the AWS machine.
To get it on your local computer (to save the lectures and exercises), go to the GitHub repo page, click on the green Code button, then Download ZIP. Beware, the download may take a significant time based on your internet connection (several hundreds MB).
“This is a course about single-cell RNA-seq analysis, after all, so where is my data?”
Ok, “your” data is (most likely) yet to be sequenced! Or maybe you’re interested in digging already existing databases! I mean, who isn’t interested in this mind-blowing achievement from 10X Genomics??
Human Cell Atlas is probably a good place to start digging, if you are interested in mammal-related studies. For instance, let’s say I am interested in epididymis differentiation. Boom: here is an entry from the HCA focusing on epididymis: link to HCA data portal.
Here is the link to the actual paper studying epididymis:
An atlas of human proximal epididymis reveals cell-specific functions and distinct roles for CFTR.
Find and check out the corresponding GEO entries for this study. What type of sequencing data is available?
Here is the link to the GEO page: link.
Can you find links to download the raw data from this paper?
There are several ways to find this information, e.g. ffq command line tool, or using the web-based sra-explorer page (here). You generally will need the GEO corresponding ID or SRA project ID (e.g. SRPxxxxxx…).
Try to install the ffq tool from the Patcher lab.
conda packages
conda-based environments allow easy installs of packages such as ffq. Your (base) conda environment should be active by default, and you will only have to type:
bash
pip install ffqCheck ffq help and try fetching metadata for the GSE ID GSE148963.
bash
ffq --help
ffq -t GSE GSE148963 > GSE148963_search.txt
head -n 30 GSE148963_search.txtCan you find the links to raw data associated with the GSE148963 GEO ID?
You can use a grep command: grep returns the lines which match a given pattern (e.g. a link…)!
bash
grep 'ftp://' GSE148963_search.txtAnd with a bit of sed magick…
bash
grep 'ftp://' GSE148963_search.txt | sed 's,.*ftp:,ftp:,' | sed 's,".*,,' | grep ".fastq" > GSE148963_fastqlist.txt
## `ffq` looks through GEO repository to fing metadata associated with the `GSE148963` entry (including downloading links)
## grep 'ftp://' recovers the text lines that contain downloading links
## the `sed` commands clean up the text lines
## the `wget` command downloads locally the links listed in the generated file (GSE148963_fastqlist.txt)
# wget -i GSE148963_fastqlist.txt ## Do not run, it would take too long...A subset of the reads has been downloaded and put in the ~/Share/ folder. Have a look at it!
bash
zcat ~/Share/SRR11575369_1.fq.gz | head -n 12Try and understand the structure of the fq.gz file. What is the meaning of each line?
How many reads are there in the fq.gz file? And how long are they? Can you get a summary of what is in this file? All of these questions can be quickly answered using the seqkit tool:
bash
seqkit --help
seqkit stats --help
seqkit stat ~/Share/SRR11575369_1.fq.gzMany times, researchers will provide a filtered count matrix when they publish scRNAseq experiments (along with mandatory raw fastq data, of course). It’s way lighter than fastq reads, and you can go ahead with downstream analyses a lot quicker. So how do you get these matrices? Human Cell Atlas Consortium provides many processed datasets. For instance, in our case, the A single cell atlas of the human proximal epididymis reveals cell-type specific functions and distinct roles for CFTR (Leir et al, 2020) study is available at the following link.
GEO also provides access to both raw and processed files.
Find GEO-hosted processed files for the Leir et al study.
Check the Geo Dataset search page.
Search for Leir 2020 CFTF. You should find the following entry.
Scrolling down to the bottom of the page, there is a box labelled “Supplementary data”. By clicking on “(custom)”, a list of extra supplementary files will appear.
Download and check the content of the count matrix, the genes and the barcodes files.
What type of information does each file contain? How is it formatted? is it easily imported in R?
For those of you who are already familiar with the basics, you can fast-forward through this lab and start working on scRNAseq data directly. The script in bin/prepare_Ernst.R is a template to process a publicly available scRNAseq dataset. You can start exploring it to see if you understsand the different chunks of code and their importance. All the content from this template will eventually be covered in the next labs.