The analysis of single-cell RNA-Seq data involves a series of pre-processing steps that include: (1) association of reads with their cells of origin, (2) collapsing of reads according to unique molecular identifiers (UMIs), and (3) generation of feature counts from the reads to generate a feature-cell matrix.

We recently introduced the BUS file format for single-cell RNA-seq data to facilitate the development of modular workflows for data pre-processing. It consists of a binary representation of barcode and UMI sequences from scRNA-seq reads, along with sets of equivalence classes obtained by pseudoalignment of reads to a reference transcriptome (hence the acronym Barcode, UMI, Set). We have implemented a command in kallisto called bus that allows for the efficient generation of BUS format from any single-cell RNA-seq technology. Tools for manipulating BUS files are provided as part of the bustools package.

The kallisto | bustools workflow is described in detail in

