Processing Multiple Lanes at Once

This tutorial provides instructions for how to pre-process the mouse T cells SRR8206317 dataset from Miller & Sen et al., 2019 using the kallisto | bustools workflow.

Download the data

!wget -q

Install kb and bamtofastq

We will be using bamtofastq to generate the original FASTQ files from the BAM files provided by the authors.

!pip install --quiet kb-python
!wget -q
!chmod +x bamtofastq-1.2.0
Download a pre-built mouse index

!kb ref -d mouse -i index.idx -g t2g.txt
[2021-03-31 23:49:33,545]    INFO Downloading files for mouse from to tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz
100% 1.89G/1.89G [01:26<00:00, 23.4MB/s]
[2021-03-31 23:51:01,788]    INFO Extracting files from tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz
CPU times: user 1.29 s, sys: 298 ms, total: 1.58 s
Wall time: 2min 1s

Generate the FASTQs from the BAM file

Use the bamtofastq utility to generate the FASTQs.

!./bamtofastq-1.2.0 --reads-per-fastq=500000000 d10_Tet_possorted_genome_bam.bam ./fastqs
bamtofastq v1.2.0
Args { arg_bam: "d10_Tet_possorted_genome_bam.bam", arg_output_path: "./fastqs", flag_nthreads: 4, flag_locus: None, flag_bx_list: None, flag_reads_per_fastq: 500000000, flag_gemcode: false, flag_lr20: false, flag_cr11: false }
Writing finished.  Observed 85992089 read pairs. Wrote 85992089 read pairs
CPU times: user 4.46 s, sys: 491 ms, total: 4.95 s
Wall time: 12min 12s

Generate an RNA count matrix in H5AD Format

The following command will generate an RNA count matrix of cells (rows) by genes (columns) in H5AD format, which is a binary format used to store Anndata objects. Notice we are providing the index and transcript-to-gene mapping we downloaded in the previous step to the -i and -g arguments respectively. Also, these reads were generated with the 10x Genomics Chromium Single Cell v2 Chemistry, hence the -x 10xv2 argument. To view other supported technologies, run kb --list.

Note: If you would like a Loom file instead, replace the --h5ad flag with --loom. If you want to use the raw matrix output by kb instead of their H5AD or Loom converted files, omit these flags.

!kb count -i index.idx -g t2g.txt -x 10xv2 -o output -t 2 \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L001_R1_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L001_R2_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L002_R1_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L002_R2_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L003_R1_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L003_R2_001.fastq.gz \
fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L004_R1_001.fastq.gz \
[2021-04-01 00:03:48,050]    INFO Using index index.idx to generate BUS file to output from
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L001_R1_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L001_R2_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L002_R1_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L002_R2_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L003_R1_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L003_R2_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L004_R1_001.fastq.gz
[2021-04-01 00:03:48,051]    INFO         fastqs/MAR1_POOL_2_10_tet_SI-GA-C8_MissingLibrary_1_HJNKJBGX5/bamtofastq_S1_L004_R2_001.fastq.gz
[2021-04-01 00:10:48,408]    INFO Sorting BUS file output/output.bus to output/tmp/output.s.bus
[2021-04-01 00:12:03,976]    INFO Whitelist not provided
[2021-04-01 00:12:03,976]    INFO Copying pre-packaged 10XV2 whitelist to output
[2021-04-01 00:12:04,105]    INFO Inspecting BUS file output/tmp/output.s.bus
[2021-04-01 00:12:22,834]    INFO Correcting BUS records in output/tmp/output.s.bus to output/tmp/output.s.c.bus with whitelist output/10xv2_whitelist.txt
[2021-04-01 00:12:39,443]    INFO Sorting BUS file output/tmp/output.s.c.bus to output/output.unfiltered.bus
[2021-04-01 00:13:57,946]    INFO Generating count matrix output/counts_unfiltered/cells_x_genes from BUS file output/output.unfiltered.bus

Load the count matrices into a notebook

See the getting started tutorial for how to load the count matrices into ScanPy for analysis.