kallisto and bustools manuals
The kallisto | bustools single-cell RNA-seq workflow requires two programs:
kallisto manual is available at: https://pachterlab.github.io/kallisto/download
bustools manual is available at: https://bustools.github.io/manual
Overview of the
kallisto | bustools workflow
Description of associated files:
A pair of read files of the form read 1 and read 2 with
.fastq.gzextensions are required. For example,
SRR8599150_S1_L001_R2_001.fastq.gzare the files used in the getting started tutorial.
A set of target sequences, typically a reference transcriptome is needed for pseudoalignment. Species transcriptomes can be downloaded from the Ensembl database page. They are usually around 100MB in size. For example the mouse transcriptome downloaded from Ensembl is named
A kallisto index must be constructed from the reference transcriptome (usually
.idxextension). Indices can be downloaded (as long as they match the reference transcriptome), or built with the
kallisto indexcommand (for details see the kallisto manual). Standard indices can usually be built on a laptop with 8Gb of RAM in 10–30 min. depending on reference transcriptome size and hardware specifications. The resulting index file will be under 4GB in size. Pre built indices are available for the human transcriptome as well as many model organisms from the kallisto transcriptome indices website.
- A barcode whitelist with
.txtextension must be input for barcode error correction. For example, a subset of barcodes from the 10x Genomics v2 chemistry whitelist may look as follows:
AAACCTGAGAAACCAT AAACCTGAGAAACCGC AAACCTGAGAAACCTA AAACCTGAGAAACGAG AAACCTGAGAAACGCC
- A transcript-to-gene file
.tsvfile is required. This is a tsv (tab separated value) file containing a mapping between the transcript Ensembl id and the gene Ensembl id. It may also have the gene name, but that is not required. Care must be taken to match the exact names in the reference transcriptome (e.g. Ensembl ID versions may or may not have been included). The
t2g.pyscript can produce the transcript-to-gene file; a bustools command will be available shortly. For example, to make the transcript-to-gene file for the getting started tutorial withthe mouse
Mus_musculus.GRCm38.96.gtfGTF file you’d use the command:
./t2g.py --use_version < Mus_musculus.GRCm38.96.gtf > transcripts_to_genes.txt
This will create the file
transcripts_to_genes.txt. Some examples of such files are provided below: Gene names, no Ensembl version:
ENSMUST00000162897 ENSMUSG00000051951 Xkr4 ENSMUST00000159265 ENSMUSG00000051951 Xkr4 ENSMUST00000161581 ENSMUSG00000089699 Gm1992 ENSMUST00000194643 ENSMUSG00000102343 Gm37381
No gene names and no Ensembl version:
ENSMUST00000162897 ENSMUSG00000051951 ENSMUST00000159265 ENSMUSG00000051951 ENSMUST00000161581 ENSMUSG00000089699 ENSMUST00000194643 ENSMUSG00000102343
No gene names and with Ensembl version:
ENSMUST00000162897.1 ENSMUSG00000051951 ENSMUST00000159265.2 ENSMUSG00000051951 ENSMUST00000161581.1 ENSMUSG00000089699 ENSMUST00000194643.1 ENSMUSG00000102343
bustools: once the read 1 and read 2 fastqs, thte kallisto index, and the transcript-to-gene files are ready, the count matrix can be generated with just a few commands; See the getting started tutorial. A detailed description of each file you should have at each step is provided at the Getting Started Explained page.
- Enjoy your pre-processed data! The tutorials page contains examples of how to parse BUS files and/or count matrices for downstream analysis.