An introduction to singlecell RNAseq¶
Written by Sina Booeshaghi* and Lior Pachter*. Based on material taught in Caltech course Bi/BE/CS183 by Lior Pachter and Matt Thomson, with contributions from Sina Booeshaghi, Lambda Lu, Jialong Jiang, Eduardo Beltrame, Jase Gehring, Ingileif Hallgrímsdóttir and Valentine Svensson.¶
*Division of Biology and Biological Engineering, California Institute of Technology¶
The rapid development of singlecell genomics methods starting in 2009 has created unprecedented opportunity for highly resolved measurements of cellular states. Among such methods, singlecell RNAseq (scRNAseq) is having a profound impact on biology. Here we introduce some of the key concepts of singlecell RNAseq technologies, with a focus on droplet based methods.
To learn how to preprocess and analyze singlecell RNAseq explore the following Google Colab notebooks that explain how to go from reads to results:
 Preprocessing and quality control [Python, R]
 Getting started with analysis [Python, R]
 Building and annotating an atlas [Python, R]
The kallistobus.tools tutorials site has a extensive list of tutorials and vignettes on singlecell RNAseq.
Setup¶
This notebook is a "living document". It downloads data and performs computations. As such it requires the installation of some python packages, which are installed with the commands below. In addition to running on Google Colab, the notebook can be downloaded and run locally on any machine which has python3 installed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 

Motivation¶
The goal of singlecell transcriptomics is to measure the transcriptional states of large numbers of cells simultaneously. The input to a scRNAseq method is a collection of cells, possibly from intact tissue, or in dissociated form. Formally, the desired output is a transcripts x cells or genes x cells matrix that describes, for each cell, the abundance of its constituent transcripts or genes. More generally, singlecell genomics methods seek to measure not just transcriptional state, but other modalities in cells, e.g. protein abundances, epigenetic states, cellular morphology, etc.
The ideal singlecell technology should thus:
 Be universal in terms of cell size, type and state.
 Perform in situ measurements.
 Have no minimum input requirements.
 Assay every cell, i.e. have a 100% capture rate.
 Detect every transcript in every cell, i.e. have 100% sensitivity.
 Identify individual transcripts by their fulllength sequence.
 Assign transcripts correctly to cells, e.g. no doublets.
 Be compatible with additional multimodal measurements.
 Be cost effective per cell.
 Be easy to use.
 Be open source so that it is transparent, and results from it reproducible.
There is no method satisfying all of these requirements, however progress has been rapid. The development of singlecell RNAseq technologies and their adoption by biologists, has been remarkable. Svensson et al. 2019 describes a database of articles which present singlecell RNAseq experiments, and the graph below, rendered from the current version of the database, makes clear the exponential growth in singlecell transcriptomics:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

There are many different scRNAseq technologies in use and under development, but broadly they fall into a few categories  wellbased methods (e.g. Fluidigm SMARTer C1, Smartseq2)  dropletbased methods (e.g. Dropseq, InDrops, 10X Genomics Chromium)  spatial transcriptomics approaches (e.g. MERFISH, SEQFISH)
At the time of initial writing of this document (2019), dropletbased approaches have become popular due to their relative lowcost, easy of use, and scalability. This is evident in a breakdown of articles by technology used:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 

<ggplot: (9223363288449985206)>
We therefore restrict this exposition to dropletbased technologies.
Dropletbased methods¶
Droplet based singlecell RNAseq methods were popularized by a pair of papers published concurrently in 2015:  Macosko et al., Highly parallel genomewide expression profiling of individual cells using nanoliter droplets, 2015. DOI:10.1016/j.cell.2015.05.002  describes Dropseq.  Klein et al., Droplet barcoding for singlecell transcriptomics applied to embryonic stem cells, 2015. DOI:10.1016/j.cell.2015.04.044  descibes inDrops.
Both of the methods makes use of developments in microfluidics published in:  Song, Chen, Ismagilov, Reactions in droplets in microfluidic channels, 2006. DOI:10.1002/anie.200601554  Guo, Rotem, Heyman and Weitz, Droplet microfluidics for highthroughput biological assays, 2012. DOI:10.1039/C2LC21147E
Overview¶
An overview of how a droplet based scRNAseq method works is illustrated in a figure from the Dropseq Macosko et al. 2015 paper:
A microfluidic device is used to generate an emulsion, which consists of aqueous droplets in oil. The droplets are used to encapsulate cells, beads and reagents. In other words, each droplet is a "mini laboratory" in which the RNA from a singlecell can be captured and prepared for identification. Thus, the consistuent parts are as follows:
 an emulsion (white circles containing beads and cells on the right handside of the figure).
 dissociated cells (depicted as oddshaped colored objects in the figure).
 beads (flowing in from the left hand side of the figure).
Emulsions¶
The foundation of droplet based singlecell RNAseq methods are monodispersed emulsions. Monodispersed refers to the requirements that droplets are of (near) uniform size. Monodispersed emulsions can be generated with a microfluidic device, as shown below. The droplets are being "pinched off" at the junction, and one can see a polystyrene bead being captured in one droplet, while others are empty.
The movie is from the McCarolll Dropseq tutorial courtesy of Patrick Stumpf, Matthew RoseZerilli, Rosanna Smith, Martin Fischlechner & Jonathan West at the Centre for Hybrid Biodevices & Cancer Sciences Unit at the University of Southampton.
Beads¶
The figure above, reproduce from Klein et al. 2015, shows the procedure used to make hydrogel beads for inDrops. Every bead contains the same barcode sequence, while the barcode sequences on two different beads are distinct.
The barcode and UMI structure for a variety of technologies is viewable in a compilation by Xi Chen.
Single cell suspensions¶
In order to assay the transcriptomes of individual cells with dropletbased singlecell RNAseq technologies, it is necessary to first dissociate tissue. Procedures for tissue dissociation are varied, and highly dependent on the organism, type of tissue, and many other factors. Protocols may be be enzymatic, but can also utilize mechanical dissociators. The talk below provides an introduction to tissue handling and dissociation.
1 2 3 

Statistics of beads & cells in droplets¶
The binomial distribution¶
An understanding of dropletbased singlecell RNAseq requires consideration of the statistics describing the capture of cells and beads in droplets. Suppose that in an experiment multiple droplets have been formed, and focus on one of the droplets. Assume that the probability that any single one of $n$ cells were captured inside it is $p$. We can calculate the probability that $k$ cells have been captured in the droplet as follows:
$$ \mathbb{P}({\mbox Droplet\ contains\ k\ cells}) = \binom{n}{k}p^k(1p)^{nk}.$$
The expected number of cells in the droplet is
$$\lambda := \sum_{k=0}^n k \binom{n}{k}p^k(1p)^{nk} = n \cdot p.$$
We plot this distribution on number of cells in a droplet below. It is called the Binomial distribution and has two parameters: $n$ and $p$.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

With $n=10$ and $p=0.02$, it's quite probable that the droplet is empty, and while possible that it contains one cell, unlikely that it has 2 or more. This is a good regime for a singlecell experiment; we will see that it is problematic if two cells are captured in a single droplet. Empty droplets are not problematic in the sense that they will not lead to data, and can therefore be ignored.
The Poisson distribution¶
The Binomial distribution can be difficult to work with in practice. Suppose, for example, that $n=1000$ and $p=0.002$. Suppose that we are interested in the probability of seeing 431 cells in a droplet. This probability is given by
$$\binom{1000}{421}0.02^{421}(10.02)^{1000431},$$
which is evidently a difficult number to calculate exactly.
A practical alternative to the binmomial is the Poisson distribution. The Poisson distribution has one parameter, and its support is the nonnegative integers. A random variable $X$ is Poisson distributed if $$\mathbb{P}(X=k)\quad = \quad \frac{e^{\lambda}\lambda^k}{k!}.$$
The Poisson limit theorem states that if $p_n$ is a sequence of real numbers in $[0,1]$ with the sequence $np_n$ converging to to a finite limit $\lambda$, then $${\mbox lim}_{n \rightarrow \infty} \binom{n}{k}p_n^{k}(1p_n)^{nk} = e^{\lambda}\frac{\lambda^k}{k!}.$$
Thus, the Poisson distribution serves as a useful, tractable distribution to work with in lieu of the Binomial distribution for large $n$ and small $p$.
The histogram below can be used to explore the Poisson and its relationship to the binomial
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 

We therefore posit that
$$ \mathbb{P}({\mbox Droplet\ contains\ k\ cells}) = \frac{e^{\lambda}\lambda^k}{k!}$$ and
$$ \mathbb{P}({\mbox Droplet\ contains\ j\ beads}) = \frac{e^{\mu}\mu^j}{j!}.$$
Droplet tuning¶
Cell capture and bead overload¶
The cell capture rate is the probability that a droplet has at least one bead, and is given by $1e^{\mu}$.
The bead overload rate is the rate at which captured single cells are associated with two or more different barcodes, which will happen when multiple beads are loaded into a droplet with one cell. The probability this happens is $$\frac{1e^{\mu}\mu e^{\mu}}{1e^{\mu}}.$$
This leads to a tradeoff, as shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 

SubPoisson loading¶
In order to circumvent the limit posed by a Poisson process for beads in droplets, the inDrops method uses tightly packed hydrogel beads that can be injected into droplets without loss. This approach, which leads to "subPoisson loading" is also used by 10X Genomics, and allows for increased capture rate.
The difference is shown in two videos from the Abate lab linked to below. The first video, shows beads loaded being loaded in droplets with Poisson statistics:
1 2 

The next video shows subPoisson loading with hydrogel beads. In this case the flow rate has been set so that exactly two beads are situated in each droplet.
1 2 

The following shows the types of beads used for different dropletbased scRNAseq methods, and associated properties:
Property  Dropseq  inDrops  10x genomics 

Bead material  Polystyrene  Hydrogel  Hydrogel 
Loading dynamics  Poisson  subPoisson  subPoisson 
Dissolvable  No  No  Yes 
Barcode release  No  UV release  Chemical release 
Customizable  Demonstrated  Not shown  Feasible 
Licensing  Open source  Open Source  Proprietary 
Availability  Beads are sold  Commercial  Commercial 
Barcode collisions¶
Barcode collisions arise when two cells are separately encapsulated with beads that happen to contain identical barcodes.
For $n$ assayed cells with $m$ barcodes, the barcode collision rate is the expected proportion of assayed cells that did not receive a unique barcode, i.e.
$$1\frac{\mathbb{E}[\mbox{cells with a unique barcode}]}{\mbox{number of cells}}$$
$$= 1(1\frac{1}{m})^{n1} \approx 1\left(\frac{1}{e}\right)^\frac{n}{m}.$$
Avoiding barcode collisions requires high barcode diversity, i.e. a small ratio of $\frac{n}{m}$.
1 2 3 4 5 6 7 8 9 10 11 12 13 

Barcode diversity and length¶
A 1% barcode collision rate requires a barcode diversity of ~1%, i.e. the number of barcodes should be 100 times the number of cells. The number of barcodes from a sequence of length $L$ is $4^L$. Therefore, to assay $n$ cells, the barcode sequence must be of length at least $log_4n+3\frac{1}{3}$. This is a minimum and does not account for the need to be robust to sequencing errors.
Technical doublets¶
Technical doublets arise when two or more cells are captured in a droplet with a single bead. The technical doublet rate is therefore the probability of capturing two or more cells in a droplet given that at least one cell has been captured in a droplet:
$\frac{1e^{\lambda}\lambda e^{\lambda}}{1e^{\lambda}}$.
Note that "overloading" a dropletbased singlecell experiment by loading more cells while keeping flow rates constant will increase the number of technical doublets due to an effective increase in $\lambda$ and also the number of synthetic doublets due to an increase in barcode diversity.
The barnyard plot¶
Technical doublet rates can be measured by experiments in which a mixture of cells from two different species are assayed together. For example, if mouse and human cells are pooled prior to singlecell RNAseq, the resultant reads ought to be assignable to either human or mouse. If a droplet contained a "mixed" doublet, i.e. two cells one of which is from human and the other from mouse, it will generate reads some of which can be aligned to mouse, and some to human.
An example from a 10X Genomics dataset (5k 1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells) is shown in the plot below, which is called a Barnyard plot in Macosko et al. 2015.
1 2 3 4 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

THe plot shows that there are only 7 doublets out of 5,000 cells in this experiment. This is an unusually small number and atypical of most experiments, where doublet rates are between 5%15% (see DePasquale et al. 2018); perhaps the 5k human mouse PBMC dataset data is articularly "clean" as it is an advertisement distributed by 10X Genomics.
Bloom's correction¶
The 7 doublets identifiable by eye in the plot above are all mixed doublets, i.e. they contain one human and one mouse cell. However doublets may consist of two mouse cells, or two human cells. If the number of droplets containing at least one human cells is $n_1$, the number containing at least one mouse cell is $n_2$, and the number of mixed doublets is $n_{1,2}$, then an estimate for the actual doublet rate can be obtained from the calculation below (Bloom 2018):
Given $n_1, n_2$ and $n_{1,2}$ as described above (note that $n_1$ is the number of cells on the x axis + the number of mixed doublets and $n_2$ is the number of cells on the y axis + the number of mixed doublets), then in expectation
$$\frac{n_1}{N} \cdot \frac{n_2}{N} = \frac{n_{1,2}}{N}, $$
where $N$ is the total number of droplets. From this we see that
$$ \hat{N} = \frac{n_1 \cdot n_2}{n_{1,2}}.$$
This is the maximum likelihood LincolnPetersen estimator for population size from mark and recapture.
Let $\mu_1$ nad $\mu_2$ be the Poisson rates for the respective types of cells, i.e. the average number of cells of each type per droplet. Then
$$ \hat{\mu}_1 = \mbox{ln } \left( \frac{Nn_1}{N} \right)$$ and $$ \hat{\mu}_2 = \mbox{ln } \left( \frac{Nn_2}{N} \right).$$
From this the doublet rate $D$ can be estimated as
$$\hat{D} = 1  \frac{(\mu_1+\mu_2)e^{\mu_1+\mu_2}}{1e^{\mu_1\mu_2}}.$$
Biological doublets¶
Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a singlecell suspension. Note that biological doublets cannot be detected in barnyard plots.
One approach to avoiding biological doublets is to perform singlenuclei RNAseq. See, e.g. Habib et al., 2017. However, biological doublets are not necessarily just a technical artifact to be avoided. Halpern et al., 2018 utilizes biological doublets of hepatocytes and liver endothelial cells to assign tissue coordinates to liver endothelial cells via imputation from their hepatocyte partners.
Unique Molecular Identifiers¶
The number of distinct UMIs on a bead in a droplet is at most $4^L$ where $L$ is the number of UMI bases. For example, for 10X Genomics v2 technology $L=10$ and for 10X Genomics v3 technology $L=12$. Melsted, Booeshaghi et al. 2019 show how to estimate the number of the actual distinct UMIs on each bead for which data is obtained in a scRNAseq experiment.
Summary¶
Selection of a singlecell RNAseq method requires choosing among many tradeoffs that reflect the underlying technologies. The table below, from From Zhang et al. 2019. DOI:10.1016/j.molcel.2018.10.020, summarizes the three most popular dropletbased singlecell RNAseq assays:
The generation of singlecell RNAseq data is just the first step in understanding the transcriptomes cells. To interpret the data reads must be aligned or pseudoaligned, UMIs counted, and large cell x gene matrices examined. The growth in singlecell RNAseq analysis tools for these tasks has been breathtaking. The graph below, plotted from realtime data downloaded from the scRNAseq tools database, shows the number of tools published since 2016.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

In fact, the rate of growth of singlecell RNAseq tools is similar to that of singlecell RNAseq studies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

Next step: to learn how to analyze singlecell RNAseq data, visit the kallistobus.tools site tutorials site and explore the "Introduction 1: preprocessing and quality control" notebook in Python or R.
Feedback: please report any issues, or submit pull requests for improvements, in the Github repository where this notebook is located.