Pre-processing and analysis of feature barcode single-cell RNA-seq data with KITE.¶
In this notebook, we will perform pre-processing and analysis of 10x Genomics pbmc_1k_protein_v3 feature barcoding dataset using the Kallisto Indexing and Tag Extraction (KITE) workflow, implemented with a wrapper called kb. It was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi.
In Feature Barcoding assays, cellular data are recorded as short DNA sequences using procedures adapted from single-cell RNA-seq. The KITE workflow generates a "Mismatch Map" containing the sequences of all Feature Barcodes used in the experiment as well as all of their single-base mismatches. The Mismatch Map is used to produce transcipt-to-gene (.t2g) and fasta (.fa) files to be used as inputs for kallisto. An index is made with kallisto index, then kallisto | bustools effectively searches the sequencing data for the sequences in the Mismatch Map.
CPU times: user 477 ms, sys: 86.2 ms, total: 563 ms
Wall time: 1min
Then, we verify the integrity of the files we downloaded to make sure they were not corrupted during the download.
1
!md5sum -c checksums.txt --ignore-missing
pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz: OK
pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz: OK
kb is able to generate a FASTA file containing all hamming distance < 2 variants of the feature barcodes and create a kallisto index of these sequences. But it in order to do so, we first need to prepare a TSV containing feature barcode sequences in the first column and the feature barcode names in the second.
First, we download the feature reference file provided by 10x Genomics.
[2021-03-31 23:30:15,970] INFO Generating mismatch FASTA at mismatch.fa
[2021-03-31 23:30:15,982] INFO Creating transcript-to-gene mapping at t2g.txt
[2021-03-31 23:30:15,986] INFO Indexing mismatch.fa to mismatch.idx
The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice we are providing the index and transcript-to-gene mapping we generated in the previous step to the -i and -g arguments respectively. Also, these reads were generated with the 10x Genomics Chromium Single Cell v3 Chemistry, hence the -x 10xv3 argument. To view other supported technologies, run kb --list.
Note: If you would like a Loom file instead, replace the --h5ad flag with --loom. If you want to use the raw matrix output by kb instead of their H5AD or Loom converted files, omit these flags.
[2021-03-31 23:30:17,474] INFO Using index mismatch.idx to generate BUS file to . from
[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L001_R1_001.fastq.gz
[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L001_R2_001.fastq.gz
[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L002_R1_001.fastq.gz
[2021-03-31 23:30:17,474] INFO pbmc_1k_protein_v3_antibody_S2_L002_R2_001.fastq.gz
[2021-03-31 23:32:02,567] INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus
[2021-03-31 23:32:18,183] INFO Whitelist not provided
[2021-03-31 23:32:18,183] INFO Copying pre-packaged 10XV3 whitelist to .
[2021-03-31 23:32:19,157] INFO Inspecting BUS file ./tmp/output.s.bus
[2021-03-31 23:32:31,284] INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv3_whitelist.txt
[2021-03-31 23:32:49,746] INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus
[2021-03-31 23:33:01,416] INFO Generating count matrix ./counts_unfiltered/cells_x_features from BUS file ./output.unfiltered.bus
[2021-03-31 23:33:03,924] INFO Reading matrix ./counts_unfiltered/cells_x_features.mtx
[2021-03-31 23:33:05,698] INFO Writing matrix to h5ad ./counts_unfiltered/adata.h5ad
CPU times: user 1.3 s, sys: 180 ms, total: 1.48 s
Wall time: 2min 49s