Introduction to single-cell RNA-seq I: pre-processing and quality control¶
This R notebook demonstrates the use of the kallisto and bustools programs for pre-processing single-cell RNA-seq data (also available as a Python notebook). It streams in 1 million C. elegans reads, pseudoaligns them, and produces a cells x genes count matrix in about a minute. The notebook then performs some basic QC. It expands on a notebook prepared by Sina Booeshaghi for the Genome Informatics 2019 meeting, where he ran it in under 60 seconds during a 1 minute "lightning talk".
# The quantification of single-cell RNA-seq with kallisto requires an index. # Indices are species specific and can be generated or downloaded directly with `kb`. # Here we download a pre-made index for C. elegans (the idx.idx file) along with an auxillary file (t2g.txt) # that describes the relationship between transcripts and genes.download.file("https://caltech.box.com/shared/static/82yv415pkbdixhzi55qac1htiaph9ng4.idx",destfile="idx.idx")download.file("https://caltech.box.com/shared/static/cflxji16171skf3syzm8scoxkcvbl97x.txt",destfile="t2g.txt")
In this notebook we pseudoalign 1 million C. elegans reads and count UMIs to produce a cells x genes matrix. These are located at XXX and instead of being downloaded, are streamed directly to the Google Colab notebook for quantification.
See this blog post for more details on how the streaming works.
The data consists of a subset of reads from GSE126954 described in the paper:
The "knee plot" is sometimes shown with the UMI counts on the y-axis instead of the x-axis, i.e. flipped and rotated 90 degrees. Make the flipped and rotated plot. Is there a reason to prefer one orientation over the other?
This notebook has demonstrated the pre-processing required for single-cell RNA-seq analysis. kb is used to pseudoalign reads and to generate a cells x genes matrix. Following generation of a matrix, basic QC helps to assess the quality of the data.
# Running time of the notebookSys.time()-start_time