Download the genomic (DNA) FASTA and GTF annotations for your desired organism from the database of your choice. This tutorial uses mouse reference files downloaded from Ensembl.
kb automatically splits the genome into cDNA and intron FASTA files. Because Google Colab has limited memory, we need to split the index into parts (here, we use -n 4). This will reduce the maximum memory kb uses, but the runtime of kb count will increase, which is a fair tradeoff in favor of less memory.
[2020-01-16 03:31:03,222] INFO Preparing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz, Mus_musculus.GRCm38.98.gtf.gz[2020-01-16 03:31:03,222] INFO Decompressing Mus_musculus.GRCm38.dna.primary_assembly.fa.gz to tmp[2020-01-16 03:31:30,853] INFO Sorting tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa to /content/tmp/tmpl3plby1k[2020-01-16 03:38:59,002] INFO Decompressing Mus_musculus.GRCm38.98.gtf.gz to tmp[2020-01-16 03:39:03,235] INFO Sorting tmp/Mus_musculus.GRCm38.98.gtf to /content/tmp/tmp7ebqamug[2020-01-16 03:40:00,940] INFO Splitting genome tmp/Mus_musculus.GRCm38.dna.primary_assembly.fa into cDNA at /content/tmp/tmp1zp3oo9w[2020-01-16 03:40:00,940] WARNING The following chromosomes were found in the FASTA but doens't have any "transcript" features in the GTF: GL456213.1, JH584302.1, GL456392.1, JH584300.1, GL456396.1, GL456383.1, GL456389.1, GL456379.1, GL456378.1, GL456359.1, JH584301.1, GL456366.1, GL456360.1, GL456370.1, GL456368.1, GL456390.1, GL456393.1, GL456387.1, GL456382.1, GL456394.1, GL456367.1. No sequences will be generated for these chromosomes.[2020-01-16 03:41:14,163] INFO Wrote 142446 cDNA transcripts[2020-01-16 03:41:14,168] INFO Creating cDNA transcripts-to-capture at /content/tmp/tmpxjcopm_m[2020-01-16 03:41:15,248] INFO Splitting genome into introns at /content/tmp/tmpmo19l2ry[2020-01-16 03:45:44,829] INFO Wrote 647972 intron sequences[2020-01-16 03:45:44,836] INFO Creating intron transcripts-to-capture at /content/tmp/tmprztg9wry[2020-01-16 03:46:51,138] INFO Concatenating 1 cDNA FASTAs to cdna.fa[2020-01-16 03:46:55,954] INFO Concatenating 1 cDNA transcripts-to-captures to cdna_t2c.txt[2020-01-16 03:46:56,028] INFO Concatenating 1 intron FASTAs to intron.fa[2020-01-16 03:47:46,054] INFO Concatenating 1 intron transcripts-to-captures to intron_t2c.txt[2020-01-16 03:47:46,396] INFO Concatenating cDNA and intron FASTAs to /content/tmp/tmpn9ihglyn[2020-01-16 03:49:08,305] INFO Creating transcript-to-gene mapping at t2g.txt[2020-01-16 03:50:17,445] INFO Splitting /content/tmp/tmpn9ihglyn into 8 parts[2020-01-16 03:51:11,012] INFO Indexing /content/tmp/tmphffimvvn to index.idx.0[2020-01-16 04:09:50,394] INFO Indexing /content/tmp/tmpc8mhcmkw to index.idx.1[2020-01-16 04:25:30,897] INFO Indexing /content/tmp/tmpcpy9f0cj to index.idx.2[2020-01-16 04:41:43,273] INFO Indexing /content/tmp/tmp6bmc3x9s to index.idx.3[2020-01-16 04:57:03,671] INFO Indexing /content/tmp/tmpj3nyxnh1 to index.idx.4[2020-01-16 05:11:56,815] INFO Indexing /content/tmp/tmpwg6y882o to index.idx.5[2020-01-16 05:27:03,548] INFO Indexing /content/tmp/tmp00l853b4 to index.idx.6[2020-01-16 05:42:58,221] INFO Indexing /content/tmp/tmpijfn8tph to index.idx.7CPU times: user 50.3 s, sys: 6.14 s, total: 56.4 sWall time: 2h 27min 11s