Analysis of single-cell RNA-seq data: building and annotating an atlas¶
This Python notebook pre-processes the pbmc_1k v3 dataset from 10X Genomics with kallisto and bustools using kb, and then performs an analysis of the cell types and their marker genes.
If you use the methods in this notebook for your analysis please cite the following publications which describe the tools used in the notebook:
Melsted, P., Booeshaghi, A.S. et al. Modular and efficient pre-processing of single-cell RNA-seq. bioRxiv (2019). doi:10.1101/673285
Wolf, F. A., Angere, P. and Theis, F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology (2018). doi:10.1186/s13059-017-1382-0
An R notebook implementing the same analysis is available here. See the kallistobus.tools tutorials site for additional notebooks demonstrating other analyses.
%%time# These packages are pre-installed on Google Colab, but are included here to simplify running this notebook locally!pipinstallmatplotlib!pipinstallscikit-learn!pipinstallnumpy!pipinstallscipy!pip3installleidenalg!pipinstalllouvain!pipinstallscanpy
%%time# Download the data from the 10x website!wgethttp://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar# unpack the downloaded files!tar-xvfpbmc_1k_v3_fastqs.tar
[2020-02-07 22:18:20,494] INFO Downloading files for human from https://caltech.box.com/shared/static/v1nm7lpnqz5syh8dyzdk2zs8bglncfib.gz to tmp/v1nm7lpnqz5syh8dyzdk2zs8bglncfib.gz[2020-02-07 22:20:56,952] INFO Extracting files from tmp/v1nm7lpnqz5syh8dyzdk2zs8bglncfib.gz
[2020-02-07 22:21:43,406] INFO Generating BUS file from[2020-02-07 22:21:43,407] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz[2020-02-07 22:21:43,407] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz[2020-02-07 22:21:43,407] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz[2020-02-07 22:21:43,407] INFO pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz[2020-02-07 22:35:44,867] INFO Sorting BUS file output/output.bus to tmp/output.s.bus[2020-02-07 22:36:18,698] INFO Whitelist not provided[2020-02-07 22:36:18,698] INFO Copying pre-packaged 10XV3 whitelist to output[2020-02-07 22:36:19,918] INFO Inspecting BUS file tmp/output.s.bus[2020-02-07 22:36:37,665] INFO Correcting BUS records in tmp/output.s.bus to tmp/output.s.c.bus with whitelist output/10xv3_whitelist.txt[2020-02-07 22:40:43,336] INFO Sorting BUS file tmp/output.s.c.bus to output/output.unfiltered.bus[2020-02-07 22:40:55,939] INFO Generating count matrix output/counts_unfiltered/cells_x_genes from BUS file output/output.unfiltered.bus[2020-02-07 22:41:04,570] INFO Converting matrix output/counts_unfiltered/cells_x_genes.mtx to h5ad output/counts_unfiltered/adata.h5ad[2020-02-07 22:41:12,718] INFO Filtering with bustools[2020-02-07 22:41:12,719] INFO Generating whitelist output/filter_barcodes.txt from BUS file output/output.unfiltered.bus[2020-02-07 22:41:12,904] INFO Capturing records from BUS file output/output.unfiltered.bus to tmp/output.filtered.bus with capture list output/filter_barcodes.txt[2020-02-07 22:41:15,006] INFO Sorting BUS file tmp/output.filtered.bus to output/output.filtered.bus[2020-02-07 22:41:23,062] INFO Generating count matrix output/counts_filtered/cells_x_genes from BUS file output/output.filtered.bus[2020-02-07 22:41:31,389] INFO Converting matrix output/counts_filtered/cells_x_genes.mtx to h5ad output/counts_filtered/adata.h5adCPU times: user 6.04 s, sys: 838 ms, total: 6.88 sWall time: 19min 59s
# load the unfiltered matrixresults_file='pbmc1k.h5ad'# the file that will store the analysis resultsadata=anndata.read_h5ad("output/counts_unfiltered/adata.h5ad")adata.var["gene_id"]=adata.var.index.valuest2g=pd.read_csv("t2g.txt",header=None,names=["tid","gene_id","gene_name"],sep="\t")t2g.index=t2g.gene_idt2g=t2g.loc[~t2g.index.duplicated(keep='first')]adata.var["gene_name"]=adata.var.gene_id.map(t2g["gene_name"])adata.var.index=adata.var["gene_name"]adata.var_names_make_unique()# this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
# Create a plot showing genes detected as a function of UMI counts.fig,ax=plt.subplots(figsize=(7,7))x=np.asarray(adata.X.sum(axis=1))[:,0]y=np.asarray(np.sum(adata.X>0,axis=1))[:,0]ax.scatter(x,y,color="green",alpha=0.25)ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xscale('log')ax.set_yscale('log')ax.set_xlim(1)ax.set_ylim(1)plt.show()
This plot is very misleading, as even the small alpha can't accurately show how many points are stacked at one location (This takes about a minute to run since there are a lot of points)
fig,ax=plt.subplots(figsize=(7,7))#histogram definitionbins=[1500,1500]# number of bins# histogram the datahh,locx,locy=np.histogram2d(x,y,bins=bins)# Sort the points by density, so that the densest points are plotted lastz=np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])]fora,binzip(x,y)])idx=z.argsort()x2,y2,z2=x[idx],y[idx],z[idx]s=ax.scatter(x2,y2,c=z2,cmap='Greens')fig.colorbar(s,ax=ax)ax.set_xscale('log')ax.set_yscale('log')ax.set_xlabel("UMI Counts")ax.set_ylabel("Genes Detected")ax.set_xlim(1,10**5)ax.set_ylim(1,10**4)plt.show()
In this plot cells are ordered by the number of UMI counts associated to them (shown on the x-axis), and the fraction of droplets with at least that number of cells is shown on the y-axis:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#@title Threshold cells according to knee plot { run: "auto", vertical-output: true }expected_num_cells=1178#@param {type:"integer"}knee=np.sort((np.array(adata.X.sum(axis=1))).flatten())[::-1]fig,ax=plt.subplots(figsize=(10,7))ax.loglog(knee,range(len(knee)),linewidth=5,color="g")ax.axvline(x=knee[expected_num_cells],linewidth=3,color="k")ax.axhline(y=expected_num_cells,linewidth=3,color="k")ax.set_xlabel("UMI Counts")ax.set_ylabel("Set of Barcodes")plt.grid(True,which="both")plt.show()
It is useful to examine mitochondrial genes, which are important for quality control. (Lun, McCarthy & Marioni, 2017) write that
High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.
Note you can also use the function pp.calculate_qc_metrics to compute the fraction of mitochondrial genes and additional measures.
Show those genes that yield the highest fraction of counts in each single cells, across all cells.
1
sc.pl.highest_expr_genes(adata,n_top=20)
normalizing counts per cellWARNING: Some cells have total count of genes equal to zero finished (0:00:00)
Begin by filtering cells according to various criteria. First, a filter for genes and cells based on minimum thresholds:
1
2
3
4
5
# Removes cells with less than 1070 umi countsadata=adata[np.asarray(adata.X.sum(axis=1)).reshape(-1)>1070]# Removes genes with 0 umi countsadata=adata[:,np.asarray(adata.X.sum(axis=0)).reshape(-1)>0]
1
adata
View of AnnData object with n_obs × n_vars = 1180 × 31837 var: 'gene_id', 'gene_name'
mito_genes=adata.var_names.str.startswith('MT-')# for each cell compute fraction of counts in mito genes vs. all genes# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)adata.obs['percent_mito']=np.sum(adata[:,mito_genes].X,axis=1).A1/np.sum(adata.X,axis=1).A1# add the total counts per cell as observations-annotation to adataadata.obs['n_counts']=adata.X.sum(axis=1).A1
# Create a mask to filter out cells with more than 6500 genes, less than 200 genes or less than 0.2 mitochondrial umi countsmask=np.logical_or((adata.obs.n_genes<6500).values,(adata.obs.n_genes>200).values,(adata.obs.percent_mito<0.2).values)
1
2
#filteradata=adata[mask,:]
1
adata
View of AnnData object with n_obs × n_vars = 1175 × 25953 obs: 'n_genes', 'percent_mito', 'n_counts' var: 'gene_id', 'gene_name', 'n_cells'
Total-count normalize (library-size correct) the data matrix $\mathbf{X}$ to 10,000 reads per cell, so that counts become comparable among cells.
1
2
# normalize counts in each cell to be equalsc.pp.normalize_total(adata,target_sum=10**4)
normalizing counts per cell finished (0:00:02)
Log the counts
1
2
# Replace raw counts with their logarithmsc.pp.log1p(adata)
/usr/local/lib/python3.6/dist-packages/scanpy/preprocessing/_simple.py:298: UserWarning: Revieved a view of an AnnData. Making a copy. view_to_actual(data)
Lets now look at the highest expressed genes after filtering, normalization, and log
1
sc.pl.highest_expr_genes(adata,n_top=20)
normalizing counts per cell finished (0:00:00)
Set the .raw attribute of AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.
The unnormalized data is stored in .raw.
1
adata.raw=adata
**Note**
The result of the following highly-variable-genes detection is stored as an annotation in `.var.highly_variable` and auto-detected by PCA and hence, `sc.pp.neighbors` and subsequent manifold/graph tools.
# flavor="cell_ranger" is consistent with Seurat and flavor="suerat" is not consistent with Seuratsc.pp.highly_variable_genes(adata,min_mean=0.01,max_mean=8,min_disp=1,n_top_genes=2000,flavor="cell_ranger",n_bins=20)
Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.
1
2
# We perform PCA on just the highly variable genessc.tl.pca(adata,svd_solver='arpack',use_highly_variable=True)
computing PCA with n_comps = 50 on highly variable genes finished (0:00:00)
1
sc.pl.pca(adata,color='CST3')
Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.leiden() or tSNE sc.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine.
Next we compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. In order to be consistent with Seurat's results, we use the following values.
1
sc.pp.neighbors(adata,n_neighbors=20,n_pcs=10)
computing neighbors using 'X_pca' with n_pcs = 10 finished (0:00:00)
UMAP (UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction) is a manifold learning technique that can also be used to visualize cells. It was published in:
McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).
We run that to visualize the results:
1
2
3
tl.paga(adata)
pl.paga(adata, plot=False) # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')
There are many algorithms for clustering cells, and while they have been compared in detail in various benchmarks (see e.g., Duo et al. 2018), there is no univerally agreed upon method. Here we demonstrate clustering using Louvain clustering, which is a popular method for clustering single-cell RNA-seq data. The method was published in
Blondel, Vincent D; Guillaume, Jean-Loup; Lambiotte, Renaud; Lefebvre, Etienne (9 October 2008). "Fast unfolding of communities in large networks". Journal of Statistical Mechanics: Theory and Experiment. 2008 (10): P10008.
Note that Louvain clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.
A key aspect of annotating a cell atlas is identifying "marker genes". These are genes specific to individual clusters that "mark" them, and are important both for assigning functions to cell clusters, and for designing downstream experiments to probe activity of clusters.
A gene marker analysis begins with ranking genes in each cluster according to how different they are relative to other clusters. Typically the t-test is used for this purpose.
With the exceptions of IL7R, which is only found by the t-test and FCER1A, which is only found by the other two appraoches, all marker genes are recovered in all approaches.
Louvain Group
Markers
Cell Type
0
IL7R
CD4 T cells
1
CD14, LYZ
CD14+ Monocytes
2
MS4A1
B cells
3
GNLY, NKG7
NK cells
4
FCGR3A, MS4A7
FCGR3A+ Monocytes
5
CD8A
CD8 T cells
6
MS4A1
B cells
Let us also define a list of marker genes for later reference.
adata.write(results_file,compression='gzip')# `compression='gzip'` saves disk space, but slows down writing and subsequent reading
Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.
If you want to export to "csv", you have the following options:
1
2
3
4
5
6
7
8
9
10
11
# Export single fields of the annotation of observations# adata.obs[['n_counts', 'louvain_groups']].to_csv(# './write/pbmc3k_corrected_louvain_groups.csv')# Export single columns of the multidimensional annotation# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(# './write/pbmc3k_corrected_X_pca.csv')# Or export everything except the data using `.write_csvs`.# Set `skip_data=False` if you also want to export the data.# adata.write_csvs(results_file[:-5], )
1
2
# Running time of the notebookprint("{:.2f} minutes".format((time.time()-start_time)/60))