Bioinformatics is often taught as a collection of tools.

BLAST, BWA, minimap2, STAR, Nextflow, Snakemake, Scanpy, DESeq2, Seurat.

Yet one of the most important practical skills receives surprisingly little attention:

Finding the right dataset.

This handbook focuses on where biological data comes from, how major repositories relate to each other, how to evaluate datasets before downloading them, and how to build reproducible workflows around public data.


Table of Contents

Foundations

  1. Why Public Datasets Matter
  2. The Biological Data Ecosystem
  3. Common Accession Types
  4. Essential Tools

NCBI

  1. The NCBI Ecosystem
  2. Searching NCBI Efficiently
  3. Entrez Direct Cookbook
  4. BioProject → BioSample → SRA → Assembly
  5. Downloading Sequencing Reads
  6. Downloading Assemblies
  7. Metadata Mining

Major Repositories

  1. ENA
  2. GEO
  3. GEOquery (R)
  4. ArrayExpress
  5. PRIDE
  6. MassIVE
  7. CELLxGENE
  8. Human Cell Atlas
  9. GTEx
  10. GDC and TCGA
  11. UK Biobank
  12. dbGaP
  13. EGA

Domain Guides

  1. Domain Guide: RNA-seq
  2. Domain Guide: Single Cell
  3. Domain Guide: GWAS
  4. Domain Guide: Cancer Genomics
  5. Domain Guide: Proteomics
  6. Domain Guide: Metagenomics
  7. Domain Guide: Virology
  8. Domain Guide: Aquaculture

Modern Data Access

  1. Cloud-Native Data Access

Best Practices

  1. Dataset Evaluation Checklist
  2. Reproducibility Checklist
  3. Common Pitfalls

Appendix

  1. Fifty Useful Search Ideas

Why Public Datasets Matter

Public datasets allow you to:

  • Learn bioinformatics without generating data
  • Benchmark methods
  • Reproduce publications
  • Build portfolios
  • Validate pipelines
  • Train machine learning models
  • Perform meta-analysis

Many influential bioinformatics papers are based entirely on public datasets.


The Biological Data Ecosystem

A useful mental model:

Publication
    ↓
BioProject
    ↓
BioSample
    ↓
Raw Reads (SRA)
    ↓
Assembly
    ↓
Annotation
    ↓
Downstream Analysis

Many beginners download FASTQ files immediately.

Experienced researchers often spend more time evaluating metadata than downloading sequences.


Common Accession Types

Prefix Meaning
PRJNA NCBI BioProject
PRJEB ENA BioProject
SAMN BioSample
SAMEA ENA BioSample
SRR SRA Run
ERR ENA Run
DRR DDBJ Run
GCF RefSeq Assembly
GCA GenBank Assembly
GSE GEO Series
GSM GEO Sample
PXD PRIDE Dataset
EGAS EGA Study

Essential Tools

mamba install -c bioconda \
    entrez-direct \
    sra-tools \
    datasets-cli \
    seqkit \
    csvtk \
    pigz \
    jq
Tool Purpose
Entrez Direct NCBI search
SRA Tools Read downloads
datasets-cli Assemblies
seqkit FASTA/FASTQ
csvtk Tables
jq JSON
pigz Compression

The NCBI Ecosystem

BioProject

Represents a study.

Example:

PRJNA123456

Contains:

  • Study description
  • Publications
  • Linked samples
  • Linked runs

BioSample

Represents individual samples.

Contains:

  • Host
  • Tissue
  • Country
  • Collection date
  • Isolation source

SRA

Stores raw sequencing reads.

Contains:

  • FASTQ
  • Instrument metadata
  • Library preparation information

Assembly

Stores assembled genomes.

Contains:

  • FASTA
  • GFF
  • Protein sequences
  • Annotation files

Searching NCBI Efficiently

Find metagenomes:

esearch -db bioproject \
-query "metagenome"

Find Vietnamese metagenomes:

esearch -db bioproject \
-query "metagenome AND Vietnam"

Find viral assemblies:

esearch -db assembly \
-query "txid10239[ORGN]"

Count results:

esearch -db assembly \
-query "txid10239[ORGN]" \
| xtract -pattern ENTREZ_DIRECT -element Count

Entrez Direct Cookbook

List Available Databases

einfo -dbs

View Search Fields

einfo -db nuccore \
| xtract -pattern Field \
-element Name Description

Fetch FASTA

esearch -db nuccore \
-query "txid10239[ORGN]" \
| efetch -format fasta

Fetch GenBank

esearch -db nuccore \
-query "txid10239[ORGN]" \
| efetch -format gb

BioProject → BioSample → SRA → Assembly

This relationship explains most public sequencing datasets.

BioProject
 ↓
BioSample
 ↓
SRA
 ↓
FASTQ

Or:

BioProject
 ↓
Assembly
 ↓
Genome FASTA

Understanding this graph eliminates a huge amount of confusion.


Downloading Sequencing Reads

Recommended workflow:

prefetch SRR12345678

Convert:

fasterq-dump \
    -e 8 \
    -p \
    SRR12345678

Compress:

pigz *.fastq

For large projects:

cat runs.txt \
| xargs -P 4 -n 1 prefetch

Downloading Assemblies

Single assembly:

datasets download genome accession \
GCF_000005845.2

Species:

datasets download genome taxon \
"Escherichia coli"

Complete genomes only:

datasets download genome taxon \
"Escherichia coli" \
--assembly-level complete

Metadata Mining

Metadata is often more important than sequence data.

Useful fields:

  • Host
  • Tissue
  • Country
  • Collection date
  • Isolation source
  • Disease status

Questions metadata can answer:

  • Is this human or environmental?
  • Which country?
  • Which year?
  • Which host species?

ENA

The European Nucleotide Archive mirrors much of SRA.

Advantages:

  • Direct FASTQ links
  • Strong APIs
  • Easier automation

Retrieve FASTQ URLs:

curl -s \
"https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB12345&result=read_run&fields=run_accession,fastq_ftp"

GEO

The Gene Expression Omnibus is one of the most important transcriptomics repositories.

Contains:

  • Bulk RNA-seq
  • Single-cell RNA-seq
  • Spatial transcriptomics
  • Microarrays

Common accessions:

GSE12345
GSM67890

GEOquery (R)

library(GEOquery)

gse <- getGEO("GSE12345")

Useful for reproducible transcriptomics workflows.


ArrayExpress

European alternative to GEO.

Contains:

  • RNA-seq
  • Functional genomics
  • Microarray studies

PRIDE

Main proteomics repository.

Example accession:

PXD012345

Contains:

  • Mass spectrometry
  • Peptide identifications
  • Protein quantification

MassIVE

Large-scale proteomics archive.

Often complements PRIDE.


CELLxGENE

Single-cell repository.

Contains:

  • AnnData objects
  • Cell annotations
  • Expression matrices

Ideal for Scanpy users.


Human Cell Atlas

Large international effort to catalog human cell types.

Useful for:

  • Cell annotation
  • Reference atlases
  • Marker discovery

GTEx

Genotype-Tissue Expression Project.

Useful for:

  • Tissue-specific expression
  • eQTL studies
  • Regulatory biology

GDC and TCGA

The Genomic Data Commons hosts:

  • TCGA
  • TARGET
  • CPTAC

Install:

conda install -c bioconda gdc-client

Download:

gdc-client download -m manifest.txt

Common cohorts:

  • BRCA
  • LUAD
  • GBM
  • COAD
  • SKCM

UK Biobank

Contains:

  • Genotypes
  • Exomes
  • Whole genomes
  • Imaging
  • Phenotypes

One of the most important resources in modern human genetics.

Access is controlled.


dbGaP

Database of Genotypes and Phenotypes.

Contains:

  • Human genetics studies
  • Clinical datasets
  • Controlled-access sequencing

EGA

European Genome-Phenome Archive.

European equivalent of dbGaP.

Most datasets require approval.


Domain Guide: RNA-seq

Recommended repositories:

  1. GEO
  2. ArrayExpress
  3. SRA

Workflow:

GEO
 ↓
SRA
 ↓
FASTQ
 ↓
Alignment
 ↓
Counts

Domain Guide: Single Cell

Recommended repositories:

  1. CELLxGENE
  2. GEO
  3. Human Cell Atlas

Preferred formats:

  • h5ad
  • AnnData
  • Matrix Market

Domain Guide: GWAS

Recommended repositories:

  1. UK Biobank
  2. dbGaP
  3. FinnGen
  4. GWAS Catalog

Domain Guide: Cancer Genomics

Recommended repositories:

  1. GDC
  2. TCGA
  3. CPTAC
  4. ICGC

Domain Guide: Proteomics

Recommended repositories:

  1. PRIDE
  2. MassIVE

Domain Guide: Metagenomics

Recommended repositories:

  1. SRA
  2. ENA
  3. MG-RAST
  4. Tara Oceans

Domain Guide: Virology

Useful searches:

esearch -db nuccore \
-query "influenza A[organism]"
esearch -db nuccore \
-query "coronavirus[organism]"
esearch -db nuccore \
-query "txid10239[ORGN]"

Domain Guide: Aquaculture

White Spot Syndrome Virus:

esearch -db biosample \
-query "White spot syndrome virus"

TiLV:

esearch -db bioproject \
-query "Tilapia lake virus"

AHPND:

esearch -db bioproject \
-query "AHPND OR EMS shrimp"

Cloud-Native Data Access

Increasingly, data stays in the cloud.

Platforms:

  • Terra
  • AnVIL
  • UK Biobank RAP
  • DNAnexus
  • Seven Bridges

Modern workflow:

Data
 ↓
Cloud Storage
 ↓
Cloud Compute
 ↓
Results

Dataset Evaluation Checklist

Before downloading:

  • Is metadata complete?
  • Is raw data available?
  • Are controls included?
  • Is geographic metadata available?
  • Are collection dates available?
  • Is sample size sufficient?
  • Is the study peer reviewed?
  • Is the repository maintained?

Reproducibility Checklist

Every project should contain:

project/
├── accessions.txt
├── metadata.tsv
├── download.sh
├── README.md
├── environment.yml
└── checksums.txt

Common Pitfalls

RefSeq vs GenBank

RefSeq:

  • Curated
  • Stable
  • Recommended

GenBank:

  • Larger
  • Community-submitted

Assembly Versions

Always record:

GCF_000001405.39

and not simply:

GRCh38

Missing Metadata

A small dataset with good metadata is often more valuable than a massive dataset with poor metadata.

Controlled Access

Not all data is public.

Examples:

  • UK Biobank
  • dbGaP
  • EGA

Fifty Useful Search Ideas

  • Human liver RNA-seq
  • Breast cancer RNA-seq
  • Single-cell PBMC
  • Influenza genomes
  • Coronavirus genomes
  • Complete bacterial genomes
  • Antibiotic resistance genes
  • White Spot Syndrome Virus
  • TiLV
  • Shrimp microbiome
  • Salmonella assemblies
  • E. coli assemblies
  • GTEx liver
  • TCGA BRCA
  • TCGA LUAD
  • Proteomics phosphoproteomics
  • Ocean metagenomes
  • Soil metagenomes
  • Wastewater viromes
  • Ancient DNA

(Expand and adapt these to your specific domain.)