This tutorial provides information on where to find single-cell RNA-seq data, and how to download it for processing with the kallisto | bustools workflow.
There are multiple databases that are important repositories for sequencing data and metadata, and that are relevant for obtaining single-cell RNA-seq data. For each archive we provide an example of how the data is organized and how to download it.
Biological Project Library (BioProject): The Biological Project Library organizes metadata for research projects involving genomic data types. This repository, which was started in 2016, is similar to the Gene Expression Omnibus. As an example, the data from the paper Peng et al. 2019 is organized under project accession PRJCA001063. Each single-cell RNA-seq dataset has a “BioSample accession”, e.g. SAMC047103. A further link to the Genome Sequencing Archive provides access to FASTQ files.
Genome Sequence Archive (GSA): This repository contains reads for projects in FASTQ format. For example, reads for SAMC047103 from the PRJCA001063 in the BioProject repository are accessible under accession CRA001160. A specific run accession, e.g. CRR034516 provides direct access to FASTQ files.
Gene Expression Omnibus (GEO): The Gene Expression Omnibus is a repository for MIAME (Minimum Infomration about a Microarray Experiment) compliant data. While the MIAME standards were established during a time when gene expression data was primarily collected with microarrays, the standards also apply to sequencing data and the GEO repository hosts project metadata for both types of research projects. As an example, the project link for the paper Wolock et al. 2019 is GSE132151. Most papers refer to their data via GEO accessions, so GEO is a useful repository for searching for data from projects.
European Nucelotide Archive (ENA): The ENA provides access to nucleotide sequences associated with genomic projects. In the case of GSE132151 mentioned above, the nucleotide sequences are at PRJNA546231. The ENA provides direct access to FASTQ files from the project page. It also links to NCBI Sequence Read Archive format data.
Sequence Read Archive (SRA): The SRA is a sequence repository for genomic data. Files are stored in SRA format, which must be downloaded and converted to FASTQ format prior to pre-processing using the
fasterq-dumpprogram available as part of SRA tools. For example, the data in Rossi et al., 2019 can be located in the SRA via GEO, then to the SRA, and finally a sequence data page for one of the runs, SRX5779290 has information about the traces (reads). The SRA tools operate directly on SRA accessions.
The sra-explorer website is an effective and easy to use utility for searching the SRA and for downloading files. The utility finds SRA entires by keywords or accession numbers and produces links to the FASTQs and to commands for downloading them.
Single-cell RNA-seq data from sequence repositories can be streamed into
kb making possible a workflow that does not require saving files to disk prior to pre-processing. For example, the following command can be used to stream data from the a URL:
Note: Streaming is not supported on Windows.
[K |████████████████████████████████| 59.1MB 76kB/s [K |████████████████████████████████| 51kB 4.6MB/s [K |████████████████████████████████| 122kB 55.2MB/s [K |████████████████████████████████| 10.3MB 47.7MB/s [K |████████████████████████████████| 13.2MB 347kB/s [K |████████████████████████████████| 112kB 39.0MB/s [K |████████████████████████████████| 81kB 7.0MB/s [K |████████████████████████████████| 51kB 3.9MB/s [K |████████████████████████████████| 71kB 7.1MB/s [K |████████████████████████████████| 1.2MB 49.7MB/s [?25h Building wheel for loompy (setup.py) ... [?25l[?25hdone Building wheel for numpy-groupies (setup.py) ... [?25l[?25hdone Building wheel for sinfo (setup.py) ... [?25l[?25hdone Building wheel for umap-learn (setup.py) ... [?25l[?25hdone Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
Download a pre-built mouse index¶
The only required file that must be locally stored on disk prior to pre-processing is the index, which is why we download it here.
[2021-03-31 19:33:26,151] INFO Downloading files for mouse from https://caltech.box.com/shared/static/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz to tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz 100% 1.89G/1.89G [01:30<00:00, 22.4MB/s] [2021-03-31 19:34:58,426] INFO Extracting files from tmp/vcaz6cujop0xuapdmz0pplp3aoqc41si.gz CPU times: user 1.28 s, sys: 218 ms, total: 1.5 s Wall time: 2min 6s
1 2 3 4
[2021-03-31 19:35:32,311] INFO Piping https://caltech.box.com/shared/static/w9ww8et5o029s2e3usjzpbq8lpot29rh.gz to ./tmp/w9ww8et5o029s2e3usjzpbq8lpot29rh.gz [2021-03-31 19:35:32,313] INFO Piping https://caltech.box.com/shared/static/ql00zyvqnpy7bf8ogdoe9zfy907guzy9.gz to ./tmp/ql00zyvqnpy7bf8ogdoe9zfy907guzy9.gz [2021-03-31 19:35:32,314] INFO Using index index.idx to generate BUS file to . from [2021-03-31 19:35:32,314] INFO ./tmp/w9ww8et5o029s2e3usjzpbq8lpot29rh.gz [2021-03-31 19:35:32,314] INFO ./tmp/ql00zyvqnpy7bf8ogdoe9zfy907guzy9.gz [2021-03-31 19:38:44,775] INFO Sorting BUS file ./output.bus to ./tmp/output.s.bus [2021-03-31 19:38:50,622] INFO Whitelist not provided [2021-03-31 19:38:50,622] INFO Copying pre-packaged 10XV2 whitelist to . [2021-03-31 19:38:50,752] INFO Inspecting BUS file ./tmp/output.s.bus [2021-03-31 19:38:53,602] INFO Correcting BUS records in ./tmp/output.s.bus to ./tmp/output.s.c.bus with whitelist ./10xv2_whitelist.txt [2021-03-31 19:38:55,827] INFO Sorting BUS file ./tmp/output.s.c.bus to ./output.unfiltered.bus [2021-03-31 19:39:00,448] INFO Generating count matrix ./counts_unfiltered/cells_x_genes from BUS file ./output.unfiltered.bus [2021-03-31 19:39:06,425] INFO Reading matrix ./counts_unfiltered/cells_x_genes.mtx [2021-03-31 19:39:09,811] INFO Writing matrix to h5ad ./counts_unfiltered/adata.h5ad CPU times: user 1.41 s, sys: 188 ms, total: 1.6 s Wall time: 3min 39s