This tutorial provides information on where to find single-cell RNA-seq data, and how to download it for processing with the kallisto | bustools workflow.

Note: for the instructions, command line arguments are preceeded by$. For example, if you see $ cd my_folder then type cd my_folder.

Databases

There are four databases that are important repositories for sequencing data and metadata, and that are relevant for obtaining single-cell RNA-seq data. For each archive we provide an example of how the data is organized and how to download it.

  • Biological Project Library (BioProject): The Biological Project Library organizes metadata for research projects involving genomic data types. This repository, which was started in 2016, is similar to the Gene Expression Omnibus. As an example, the data from the paper Peng et al. 2019 is organized under project accession PRJCA001063. Each single-cell RNA-seq dataset has a “BioSample accession”, e.g. SAMC047103. A further link to the Genome Sequencing Archive provides access to FASTQ files.

  • Genome Sequence Archive (GSA): This repository contains reads for projects in FASTQ format. For example, reads for SAMC047103 from the PRJCA001063 in the BioProject repository are accessible under accession CRA001160. A specific run accession, e.g. CRR034516 provides direct access to FASTQ files.

  • Gene Expression Omnibus (GEO): The Gene Expression Omnibus is a repository for MIAME (Minimum Infomration about a Microarray Experiment) compliant data. While the MIAME standards were established during a time when gene expression data was primarily collected with microarrays, the standards also apply to sequencing data and the GEO repository hosts project metadata for both types of research projects. As an example, the project link for the paper Wolock et al. 2019 is GSE132151. Most papers refer to their data via GEO accessions, so GEO is a useful repository for searching for data from projects.

  • European Nucelotide Archive (ENA): The ENA provides access to nucleotide sequences associated with genomic projects. In the case of GSE132151 mentioned above, the nucleotide sequences are at PRJNA546231. The ENA provides direct access to FASTQ files from the project page. It also links to NCBI Sequence Read Archive format data.

  • Sequence Read Archive (SRA): The SRA is a sequence repository for genomic data. Files are stored in SRA format, which must be downloaded and converted to FASTQ format prior to pre-processing using the fasterq-dump program available as part of SRA tools. For example, the data in Rossi et al., 2019 can be located in the SRA via GEO, then to the SRA, and finally a sequence data page for one of the runs, SRX5779290 has information about the traces (reads). The SRA tools operate directly on SRA accessions.

Searching

The sra-explorer website is an effective and easy to use utility for searching the SRA and for downloading files. The utility finds SRA entires by keywords or accession numbers and produces links to the FASTQs and to commands for downloading them.

Streaming

Single-cell RNA-seq data from sequence repositories can be streamed into kallisto making possible a workflow that does not require saving files to disk prior to pre-processing. This can be done via process substitution or using df. For example, the following command can be used to stream data from the ENA for pre-processing:

$ urlR1="https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz"; $ urlR2="https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz"; $ time kallisto bus -i Mus_musculus.GRCm38.cdna.all.idx -x 10xv2 -t 4 -o bus_out/ <(curl -Ls ${urlR1}) <(curl -Ls ${urlR2})

The only required file that must be locally stored on disk prior to pre-processing is the transcriptome index Mus_musculus.GRCm38.cdna.all.idx. A complete tutorial for how to stream data, together with code based on mkfifo is available here.