This page provides instructions for how to use kallisto | bustools to pre-process feature barcoded single-cell RNA-seq experiments. The tutorial explains the steps using as an example the 10x Genomics pbmc_1k_protein_v3 feature barcoding dataset. A complete Jupyter notebook showing all steps and analysis can be found here.

In feature barcoding assays, cellular data are recorded as short DNA sequences using procedures adapted from single-cell RNA-seq. The kITE (“kallisto Indexing and Tag Extraction”) workflow involves generating a “Mismatch Map” containing the sequences of all feature barcodes used in the experiment, as well as all of their single-base mismatches. The Mismatch Map is used to make “mismatch” transcript-to-gene (t2g) and fasta files to be used as inputs for kallisto. kallisto is used for indexing and pseudoalignment, and then bustools is used to search the sequencing data for the sequences in the Mismatch Map. This approach effectively co-opts the kallisto | bustools infrastructure for a different application.

Note: for the instructions, command line arguments are preceeded by$. For example, if you see $ cd my_folder then type cd my_folder.

0. Install software

Obtain kallisto from the kallisto installation page, and bustools from the bustools installation page.

Prepare a folder and clone the kite GitHub repository:

$ mkdir kallisto_bustools_kite/
$ cd kallisto_bustools_kite/
$ git clone https://github.com/pachterlab/kite

1. Download materials

Download the following files:

  • 10xPBMC_1k_protein_v3 dataset
  • Antibody feature barcode sequences
  • 10x Chromium v3 chemistry barcode whitelist
$ wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_fastqs.tar
$ tar -xvf ./pbmc_1k_protein_v3_fastqs.tar
$ wget http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_protein_v3/pbmc_1k_protein_v3_feature_ref.csv
$ wget https://github.com/BUStools/getting_started/releases/download/species_mixing/10xv3_whitelist.txt

2. Make the mismatch FASTA and t2g files

Start by preparing a csv-formatted matrix of Feature Barcode names and Feaure Barcode sequences. Do not include any common or constant sequences. For this tutorial, we parsed the feature_ref.csv file provided by 10x to give a properly formatted csv (below). Example code for this step is included in the kite GitHub repo.

Feature Barcode name Feature Barcode sequence
CD3_TotalSeqB AACAAGACCCTTGAG
CD8a_TotalSeqB TACCCGTAATAGCGT
CD14_TotalSeqB GAAAGTCAAAGCACT
CD15_TotalSeqB ACGAATCAATCTGTG
CD16_TotalSeqB GTCTTTGTCAGTGCA
CD56_TotalSeqB GTTGTCCGACAATAC
CD19_TotalSeqB TCAACGCTTGGCTAG
CD25_TotalSeqB GTGCATTCAACAGTA
CD45RA_TotalSeqB GATGAGAACAGGTTT
CD45RO_TotalSeqB TGCATGTCATCGGTG
PD-1_TotalSeqB AAGTCGTGAGGCATG
TIGIT_TotalSeqB TGAAGGCTCATTTGT
CD127_TotalSeqB ACATTGACGCAACTA
IgG2a_control_TotalSeqB CTCTATTCAGACCAG
IgG1_control_TotalSeqB ACTCACTGGAGTCTC
IgG2b_control_TotalSeqB ATCACATCGTTGCCA


With the FeatureBarcodes.csv file ready, run featuremap.py, which creates a mismatch FASTA file and a mismatch t2g file for the experiment. The optional --header flag is used if the input csv has a header in the first row. In this case the mismatch file has 782 entries generated from the 17 whitelist sequences.

$./kite/featuremap/featuremap.py FeatureBarcodes.csv --header

Note: kallisto only accepts odd values for the k-mer length, so if your Feature Barcodes are even in length, add a constant base on either side before running featuremap.py. For example, append an A base to the CD3_TotalSeqB barcode AACAAGACCCTTGAG → AACAAGACCCTTGAGA

3. Build an index

Build the kallisto index using the mismatch fasta and a k-mer length -k equal to the length of the Feature Barcodes:

$ kallisto index -i FeaturesMismatch.idx -k 15 ./FeaturesMismatch.fa

4. Run kallisto

Pseudoalign the reads:

$ kallisto bus -i FeaturesMismatch.idx -o ./ -x 10xv3 -t 4 \
./pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_antibody_fastqs/pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz \
./pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_antibody_fastqs/pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz \
./pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_antibody_fastqs/pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz \
./pbmc_1k_protein_v3_fastqs/pbmc_1k_protein_v3_antibody_fastqs/pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz \

5. Run bustools

For bustools count, use the mismatch t2g file.

$ bustools correct -w ./10xv3_whitelist.txt ./output.bus -o ./output_corrected.bus
$ bustools sort -t 4 -o ./output_sorted.bus ./output_corrected.bus
$ mkdir ./featurecounts/
$ bustools count -o ./featurecounts/featurecounts --genecounts -g ./FeaturesMismatch.t2g -e ./matrix.ec -t ./transcripts.txt ./output_sorted.bus

6. Load count matrices into notebook

See the Jupyter notebook for how to process the feature count matrix.