Last updated on 2024-09-26

When I first started learning bioinformatics, I had a common question: “Where can I find datasets to practice my skills?” We need realistic datasets that mirror the challenges we’ll encounter in real-life research or projects. In this guide, I’ll walk you through downloading datasets from NCBI using two powerful tools: entrez-direct and sra-tools. Additionally, NCBI offers a tool called datasets for downloading large genome datasets, which might be helpful for some users.

Before we begin, you’ll need the following:

  • Ubuntu (or WSL on Windows)

  • (Mini)conda or Mamba (for package management)

  • Basic knowledge of Linux/Unix command-line tools

  • Familiarity with regular expressions

  • Understanding of NCBI search parameters

Step 1: Install Entrez-Direct and SRA-Tools

First, install the required tools using Mamba (Mamba is my preference due to its speed):

mamba install -c bioconda entrez-direct sra-tools

Step 2: Search and Fetch Data with Entrez-Direct

Once installed, we’ll use esearch (from entrez-direct) to query the NCBI database. To learn more about available search fields, refer to the NCBI search manual.

In this example, I’ll search the bioproject database for metagenomes related to Vietnam using NCBI’s taxonomy ID system:

esearch -db bioproject -query "txid408169[Organism] AND Vietnam[Title]" \
| efetch -format native \
| awk 'BEGIN {RS="\n"; ORS="\t"} {print}' \
| sed -E '1 s/^\t//;s/\t{3}/\n/g' \
| sed -E 's/^[0-9]*\. //' \
| awk 'BEGIN {FS=OFS="\t"} {match($0, /BioProject Accession: (.*)\tID: (.*)/, array); print $1, array[1], array[2]}'

This command searches for metagenomes (taxonomy ID txid408169) in Vietnam. The result is piped through efetch, formatted as native, and processed with awk and sed to generate a table with three columns:

  • BioProject Title (Study Name)

  • BioProject Accession

  • BioProject ID (needed for retrieving reads)

Step 3: Downloading Data from SRA

Let’s say you found a BioProject related to Norovirus with the accession PRJDB5922. To retrieve sequence data from this project, we query the SRA (Sequence Read Archive):

esearch -db sra -query "PRJDB5922" \
| efetch -format runinfo \
| awk 'BEGIN {RS=","; ORS="\t"} {print}'

This will output a table with details such as the sequencing platform, source type (e.g., RNA), and more. Now, let’s download the first three datasets from this project:

esearch -db sra -query "PRJDB5922" \
| efetch -format runinfo \
| awk 'BEGIN {FS=OFS="\t"} NR > 1 {if ($1 != "") print $1}' \
| head -n 3 \
| xargs -n 1 -P 4 fastq-dump --gzip --split-files --skip-technical --split-spot

Here, we select the first three datasets using awk and head. The fastq-dump command downloads the FASTQ files with the following options:

  • --gzip: Compress output files

  • --split-files: Split paired-end reads into separate files

  • --skip-technical: Skip technical reads

  • --split-spot: Split reads based on spot layout

Additional Tips

Retrieving Datasets from Other Databases

You can also retrieve protein, nucleotide, or other datasets by specifying a different database (-db). For example:

  • To get RNA-dependent RNA polymerase (RdRp) sequences from the RefSeq database for viruses:
esearch -db nuccore -query "RdRp[GENE] AND txid10239[ORGN] AND RefSeq[FILT]" \
| efetch -format fasta
  • To retrieve beta toxin (cpb) sequences from C. perfringens:
esearch -db nuccore -query "cpb[GENE] AND txid1502[ORGN]" \
| efetch -format fasta

Checking Available Databases and Fields

  • To list all available databases, run: einfo -dbs

annotinfo, assembly, biocollections, bioproject, biosample, blastdbinfo, books, cdd, clinvar, dbvar, gap, gapplus, gds, gene, genome, geoprofiles, grasp, gtr, homologene, ipg, medgen, mesh, nlmcatalog, nuccore, nucleotide, omim, orgtrack, pcassay, pccompound, pcsubstance, pmc, popset, protein, proteinclusters, protfam, pubmed, seqannot, snp, sra, structure, taxonomy

  • To view searchable fields for a specific database (e.g., nuccore), use: einfo -db nuccore | xtract -pattern Field -element Name Description. Here’s a small snippet of what you’ll find:
Name Description
ALL All terms from all searchable fields
UID Unique number assigned to each sequence
FILT Limits the records
WORD Free text associated with record
TITL Words in definition line

Example

To download RefSeq assemblies for C. perfringens, use:

esearch -db assembly -query "txid10239[ORGN]" \
| efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| xargs -P 3 -I {} bash -c 'name=$(basename {}); curl -o "${name}_genomic.fna.gz" {}/"${name}_genomic.fna.gz"'

Or download the .gbff files for Flavivirus:

esearch -db assembly -query "txid11051[ORGN]" \
| efetch -format docsum \
| xtract -pattern DocumentSummary -element FtpPath_RefSeq \
| sed 's/$/\/*genomic.gbff.gz/' \
| xargs -P 3 wget -c -nd

Conclusion

For bioinformatics professionals, tools like entrez-direct and sra-tools are invaluable for efficiently downloading large datasets. Combined with other Linux command-line utilities, they can be powerful components in automation workflows.