The Bioinformatics Dataset Handbook: A Complete Reference Manual
Bioinformatics is often taught as a collection of tools.
BLAST, BWA, minimap2, STAR, Nextflow, Snakemake, Scanpy, DESeq2, Seurat.
Yet one of the most important practical skills receives surprisingly little attention:
Finding the right dataset.
This handbook focuses on where biological data comes from, how major repositories relate to each other, how to evaluate datasets before downloading them, and how to build reproducible workflows around public data.
Table of Contents
Foundations
NCBI
- The NCBI Ecosystem
- Searching NCBI Efficiently
- Entrez Direct Cookbook
- BioProject → BioSample → SRA → Assembly
- Downloading Sequencing Reads
- Downloading Assemblies
- Metadata Mining
Major Repositories
- ENA
- GEO
- GEOquery (R)
- ArrayExpress
- PRIDE
- MassIVE
- CELLxGENE
- Human Cell Atlas
- GTEx
- GDC and TCGA
- UK Biobank
- dbGaP
- EGA
Domain Guides
- Domain Guide: RNA-seq
- Domain Guide: Single Cell
- Domain Guide: GWAS
- Domain Guide: Cancer Genomics
- Domain Guide: Proteomics
- Domain Guide: Metagenomics
- Domain Guide: Virology
- Domain Guide: Aquaculture
Modern Data Access
Best Practices
Appendix
Why Public Datasets Matter
Public datasets allow you to:
- Learn bioinformatics without generating data
- Benchmark methods
- Reproduce publications
- Build portfolios
- Validate pipelines
- Train machine learning models
- Perform meta-analysis
Many influential bioinformatics papers are based entirely on public datasets.
The Biological Data Ecosystem
A useful mental model:
Publication
↓
BioProject
↓
BioSample
↓
Raw Reads (SRA)
↓
Assembly
↓
Annotation
↓
Downstream Analysis
Many beginners download FASTQ files immediately.
Experienced researchers often spend more time evaluating metadata than downloading sequences.
Common Accession Types
| Prefix | Meaning |
|---|---|
| PRJNA | NCBI BioProject |
| PRJEB | ENA BioProject |
| SAMN | BioSample |
| SAMEA | ENA BioSample |
| SRR | SRA Run |
| ERR | ENA Run |
| DRR | DDBJ Run |
| GCF | RefSeq Assembly |
| GCA | GenBank Assembly |
| GSE | GEO Series |
| GSM | GEO Sample |
| PXD | PRIDE Dataset |
| EGAS | EGA Study |
Essential Tools
mamba install -c bioconda \
entrez-direct \
sra-tools \
datasets-cli \
seqkit \
csvtk \
pigz \
jq
Recommended Utilities
| Tool | Purpose |
|---|---|
| Entrez Direct | NCBI search |
| SRA Tools | Read downloads |
| datasets-cli | Assemblies |
| seqkit | FASTA/FASTQ |
| csvtk | Tables |
| jq | JSON |
| pigz | Compression |
The NCBI Ecosystem
BioProject
Represents a study.
Example:
PRJNA123456
Contains:
- Study description
- Publications
- Linked samples
- Linked runs
BioSample
Represents individual samples.
Contains:
- Host
- Tissue
- Country
- Collection date
- Isolation source
SRA
Stores raw sequencing reads.
Contains:
- FASTQ
- Instrument metadata
- Library preparation information
Assembly
Stores assembled genomes.
Contains:
- FASTA
- GFF
- Protein sequences
- Annotation files
Searching NCBI Efficiently
Find metagenomes:
esearch -db bioproject \
-query "metagenome"
Find Vietnamese metagenomes:
esearch -db bioproject \
-query "metagenome AND Vietnam"
Find viral assemblies:
esearch -db assembly \
-query "txid10239[ORGN]"
Count results:
esearch -db assembly \
-query "txid10239[ORGN]" \
| xtract -pattern ENTREZ_DIRECT -element Count
Entrez Direct Cookbook
List Available Databases
einfo -dbs
View Search Fields
einfo -db nuccore \
| xtract -pattern Field \
-element Name Description
Fetch FASTA
esearch -db nuccore \
-query "txid10239[ORGN]" \
| efetch -format fasta
Fetch GenBank
esearch -db nuccore \
-query "txid10239[ORGN]" \
| efetch -format gb
BioProject → BioSample → SRA → Assembly
This relationship explains most public sequencing datasets.
BioProject
↓
BioSample
↓
SRA
↓
FASTQ
Or:
BioProject
↓
Assembly
↓
Genome FASTA
Understanding this graph eliminates a huge amount of confusion.
Downloading Sequencing Reads
Recommended workflow:
prefetch SRR12345678
Convert:
fasterq-dump \
-e 8 \
-p \
SRR12345678
Compress:
pigz *.fastq
For large projects:
cat runs.txt \
| xargs -P 4 -n 1 prefetch
Downloading Assemblies
Single assembly:
datasets download genome accession \
GCF_000005845.2
Species:
datasets download genome taxon \
"Escherichia coli"
Complete genomes only:
datasets download genome taxon \
"Escherichia coli" \
--assembly-level complete
Metadata Mining
Metadata is often more important than sequence data.
Useful fields:
- Host
- Tissue
- Country
- Collection date
- Isolation source
- Disease status
Questions metadata can answer:
- Is this human or environmental?
- Which country?
- Which year?
- Which host species?
ENA
The European Nucleotide Archive mirrors much of SRA.
Advantages:
- Direct FASTQ links
- Strong APIs
- Easier automation
Retrieve FASTQ URLs:
curl -s \
"https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJEB12345&result=read_run&fields=run_accession,fastq_ftp"
GEO
The Gene Expression Omnibus is one of the most important transcriptomics repositories.
Contains:
- Bulk RNA-seq
- Single-cell RNA-seq
- Spatial transcriptomics
- Microarrays
Common accessions:
GSE12345
GSM67890
GEOquery (R)
library(GEOquery)
gse <- getGEO("GSE12345")
Useful for reproducible transcriptomics workflows.
ArrayExpress
European alternative to GEO.
Contains:
- RNA-seq
- Functional genomics
- Microarray studies
PRIDE
Main proteomics repository.
Example accession:
PXD012345
Contains:
- Mass spectrometry
- Peptide identifications
- Protein quantification
MassIVE
Large-scale proteomics archive.
Often complements PRIDE.
CELLxGENE
Single-cell repository.
Contains:
- AnnData objects
- Cell annotations
- Expression matrices
Ideal for Scanpy users.
Human Cell Atlas
Large international effort to catalog human cell types.
Useful for:
- Cell annotation
- Reference atlases
- Marker discovery
GTEx
Genotype-Tissue Expression Project.
Useful for:
- Tissue-specific expression
- eQTL studies
- Regulatory biology
GDC and TCGA
The Genomic Data Commons hosts:
- TCGA
- TARGET
- CPTAC
Install:
conda install -c bioconda gdc-client
Download:
gdc-client download -m manifest.txt
Common cohorts:
- BRCA
- LUAD
- GBM
- COAD
- SKCM
UK Biobank
Contains:
- Genotypes
- Exomes
- Whole genomes
- Imaging
- Phenotypes
One of the most important resources in modern human genetics.
Access is controlled.
dbGaP
Database of Genotypes and Phenotypes.
Contains:
- Human genetics studies
- Clinical datasets
- Controlled-access sequencing
EGA
European Genome-Phenome Archive.
European equivalent of dbGaP.
Most datasets require approval.
Domain Guide: RNA-seq
Recommended repositories:
- GEO
- ArrayExpress
- SRA
Workflow:
GEO
↓
SRA
↓
FASTQ
↓
Alignment
↓
Counts
Domain Guide: Single Cell
Recommended repositories:
- CELLxGENE
- GEO
- Human Cell Atlas
Preferred formats:
- h5ad
- AnnData
- Matrix Market
Domain Guide: GWAS
Recommended repositories:
- UK Biobank
- dbGaP
- FinnGen
- GWAS Catalog
Domain Guide: Cancer Genomics
Recommended repositories:
- GDC
- TCGA
- CPTAC
- ICGC
Domain Guide: Proteomics
Recommended repositories:
- PRIDE
- MassIVE
Domain Guide: Metagenomics
Recommended repositories:
- SRA
- ENA
- MG-RAST
- Tara Oceans
Domain Guide: Virology
Useful searches:
esearch -db nuccore \
-query "influenza A[organism]"
esearch -db nuccore \
-query "coronavirus[organism]"
esearch -db nuccore \
-query "txid10239[ORGN]"
Domain Guide: Aquaculture
White Spot Syndrome Virus:
esearch -db biosample \
-query "White spot syndrome virus"
TiLV:
esearch -db bioproject \
-query "Tilapia lake virus"
AHPND:
esearch -db bioproject \
-query "AHPND OR EMS shrimp"
Cloud-Native Data Access
Increasingly, data stays in the cloud.
Platforms:
- Terra
- AnVIL
- UK Biobank RAP
- DNAnexus
- Seven Bridges
Modern workflow:
Data
↓
Cloud Storage
↓
Cloud Compute
↓
Results
Dataset Evaluation Checklist
Before downloading:
- Is metadata complete?
- Is raw data available?
- Are controls included?
- Is geographic metadata available?
- Are collection dates available?
- Is sample size sufficient?
- Is the study peer reviewed?
- Is the repository maintained?
Reproducibility Checklist
Every project should contain:
project/
├── accessions.txt
├── metadata.tsv
├── download.sh
├── README.md
├── environment.yml
└── checksums.txt
Common Pitfalls
RefSeq vs GenBank
RefSeq:
- Curated
- Stable
- Recommended
GenBank:
- Larger
- Community-submitted
Assembly Versions
Always record:
GCF_000001405.39
and not simply:
GRCh38
Missing Metadata
A small dataset with good metadata is often more valuable than a massive dataset with poor metadata.
Controlled Access
Not all data is public.
Examples:
- UK Biobank
- dbGaP
- EGA
Fifty Useful Search Ideas
- Human liver RNA-seq
- Breast cancer RNA-seq
- Single-cell PBMC
- Influenza genomes
- Coronavirus genomes
- Complete bacterial genomes
- Antibiotic resistance genes
- White Spot Syndrome Virus
- TiLV
- Shrimp microbiome
- Salmonella assemblies
- E. coli assemblies
- GTEx liver
- TCGA BRCA
- TCGA LUAD
- Proteomics phosphoproteomics
- Ocean metagenomes
- Soil metagenomes
- Wastewater viromes
- Ancient DNA
(Expand and adapt these to your specific domain.)