http://54.191.190.243:8787
3 Lab 1: Familiarizing yourself with the course AWS instance
The estimated time for this lab is around 1h15.
- Access to the workshop AWS instance.
- Familiarize yourself with the
RStudio
interface. - Learn how to use the terminal within
RStudio
. - Learn how to download data from the web using
wget
.
3.1 Connect to RStudio Server
Most of single-cell RNA-seq analysis takes place either in python
or in R
. Here, we focus on how to leverage R
to investigate scRNAseq data. RStudio
is an IDE
(Integrated Development Environment, in other words: a nice graphical interface to run R
-related commands).
For this workshop, we have installed R
and RStudio
on AWS. We can directly use RStudio
(actually, RStudio-server
since it is installed on an AWS remote server). Simply open a browser and copy-paste the following address:
An RStudio
log in page will appear; to log in, use your user
ID for both ID and password.
Notice how when you log in Rstudio
, there are multiple panels. Familiarize yourself with the different panels.
The interactive R
console is generally found in the bottom left corner of RStudio
(though it may be in another corner sometimes). All the rest (history panel, environment panel, directory explorer panel, editor panel) are extra features provided by RStudio
.
R
:
Within the R
console, you can safely use R
-dedicated commands. Do you know the most common ones? The semantics are a different from the terminal
commands you may be used to…
R
getwd() # equivalent of `pwd` in terminal
dir.create('~/data/') # equivalent of `mkdir ~/data/` in terminal
setwd("~/data/") # equivalent of `cd ~/data/` in terminal
list.files("~/data/") # equivalent of `ls` in terminal
download.file("...") # equivalent of `wget ...` in terminal
3.2 Use a AWS terminal within RStudio
A general issue with bioinformatic analyses stems from the fact that nobody works in the same environment:
- Are you working on Mac? Linux? Windows?
- Do you have a lot of computational power? Perhaps a GPU card?
- Are you connected to the Internet? With a fast connection? Are you working behind a proxy?
To ensure that we are all working in the same environment, we rely on AWS (Amazon Web Services) EC2 (Elastic Cloud 2) instances. EC2 instances are “virtual” computers to which you can connect remotely, from a local computer.
The instance is common for everybody. We are thus all sharing the same “computer”; this means:
- Shared resources
- Same access to shared files
- Same access to system-wide softwares and conda environments
The easiest way for us to launch bash
commands from a terminal in AWS is to do it through RStudio
: You can open up a terminal
directly from within RStudio
as follow: go to Tools
> Terminal
> New terminal
. This should open up a new tab in the bottom left corner (next to the R
console).
R console
versus terminal
:
From here onwards, be sure you completely understand the difference between “R console
” and “terminal
(or shell
)”. They are entirely different things, and can be both accessed within RStudio
. It is crucial you understand the difference between the two to not get confused for the rest of the course.
3.3 Basic terminal commands
The same bash
commands are available in AWS terminal
, regardless of whether you access the terminal from RStudio
or through ssh
.
One can list files, download files, check help pages, …, just like in R
.
- Check the your present directory
bash
pwd
- Check history
bash
history
- put history into a
history.txt
file
bash
history > history.txt
- make a new folder called data
bash
mkdir data
- Go to the new
data
directory
bash
cd data
- move
history.txt
file intodata
directory
bash
mv ../history.txt ./
- check manual page of
wget
command (hitq
to exit)
bash
man wget
- check specific help for
cellranger
command and subcommands
bash
cellranger --help
cellranger count --help
- redirect wget help output into a file called
cellranger-help.txt
bash
cellranger count --help > cellranger-help.txt
- Download a file from Internet with
wget
bash
wget https://cf.10xgenomics.com/supp/cell-exp/cellranger-tiny-bcl-1.2.0.tar.gz
- List all files in a folder
bash
ls -l ~/Share/
Download the git
repository for this course from GitHub
:
bash
git clone https://github.com/js2264/scRNAseq_Physalia_2024.git
This downloads the repository for this course to your home folder on the AWS machine.
To get it on your local computer (to save the lectures and exercises), go to the GitHub repo page, click on the green Code
button, then Download ZIP
. Beware, the download may take a significant time based on your internet connection (several hundreds MB).
3.4 Single-cell RNA-seq datasets
“This is a course about single-cell RNA-seq analysis, after all, so where is my data?”
Ok, “your” data is (most likely) yet to be sequenced! Or maybe you’re interested in digging already existing databases! I mean, who isn’t interested in this mind-blowing achievement from 10X Genomics??
Human Cell Atlas is probably a good place to start digging, if you are interested in mammal-related studies. For instance, let’s say I am interested in epididymis differentiation. Boom: here is an entry from the HCA focusing on epididymis: link to HCA data portal.
3.4.1 Raw fastq reads from GEO
Here is the link to the actual paper studying epididymis:
An atlas of human proximal epididymis reveals cell-specific functions and distinct roles for CFTR.
There are several ways to find this information, e.g. ffq
command line tool, or using the web-based sra-explorer
page (here). You generally will need the GEO
corresponding ID or SRA project ID (e.g. SRPxxxxxx
…).
3.4.2 Processed count matrices
Many times, researchers will provide a filtered count matrix when they publish scRNAseq experiments (along with mandatory raw fastq
data, of course). It’s way lighter than fastq
reads, and you can go ahead with downstream analyses a lot quicker. So how do you get these matrices? Human Cell Atlas Consortium provides many processed datasets. For instance, in our case, the Leir et al
study is available at the following link. GEO also hosts processed files.
- Find GEO-hosted processed files for the
Leir et al
study.
You can download some of the processed files available in GEO from the following webpage. Scrolling down to the bottom of the page, there is a box labelled “Supplementary data”. By clicking on “(custom)”, a list of extra supplementary files will appear.
- Download and check the content of the count matrix, the genes and the barcodes files.
- What type of information does each file contain? How is it formatted? is it easily imported in R?
- How many cells were sequenced? How many genes were counted?
- Is it easy to interpret the count matrix? Why is it in such format?
- Comment on the file sizes between processed count matrix files and raw reads.
3.5 Bonus
For those of you who are already familiar with the basics, you can fast-forward through this lab and start working on scRNAseq data directly. The script in bin/prepare_Ernst.R
is a template to process a publicly available scRNAseq dataset. You can start exploring it to see if you understsand the different chunks of code and their importance. All the content from this template will eventually be covered in the next labs.