Hierarchical Function Classification using UniProt and KEGG
Introduction
When working on metagenomic projects, a common question arises: “How do we determine which functional group(s) a gene of interest belongs to?” In this guide, I will demonstrate an easy and effective way to classify genes using the KEGG and UniProt databases.
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a widely known resource for studying biological pathways. Beyond pathways, KEGG also provides insight into gene functions through a hierarchical classification system called KEGG BRITE, which organizes biological objects such as:
- Genes and proteins
- Compounds and reactions
- Drugs
- Diseases
- Organisms and viruses
By combining KEGG BRITE with UniProt, you can categorize genes into their respective functional hierarchies. This tutorial will guide you through the steps to achieve this classification efficiently.
Step 1: Download Necessary Databases
First, download the following datasets:
After downloading and unzipping the UniProt Swiss-Prot file, you will get a FASTA file containing approximately 568,000 records. This will be used to extract UniProt IDs and map them to KEGG gene IDs, which will help link these genes to functional KO (KEGG Orthology) IDs.
Step 2: Map UniProt IDs to KEGG Gene and KO IDs
We will map the UniProt IDs to KEGG gene IDs and KO IDs using the KEGG API. You can use the following command to create a table with UniProt ID, KEGG Gene ID, and KO ID:
for i in $(awk '/^>/ {match($0, /\|(.+)*\|/, a); print a[1]}' uniprot_sprot.fasta); do
curl -s -L https://rest.kegg.jp/conv/genes/uniprot:${i} | awk 'BEGIN {FS=OFS="\t"} {"curl -s -L https://rest.kegg.jp/link/ko/" $2 | getline l; if (l!="") print $1, l}';
done > uniprot_genes_ko.tsv
This command retrieves the gene and KO IDs from KEGG and stores them in a TSV file. Since the process can take a long time, consider limiting the number of simultaneous connections using the xargs command:
awk '/^>/ {match($0, /\|(.+)*\|/, a); print a[1]}' uniprot_sprot.fasta | xargs -P 3 -I {} bash -c '
awk -v uniprot_id={} '\''BEGIN {FS=OFS="\t"; "curl -s -L https://rest.kegg.jp/conv/genes/uniprot:" uniprot_id | getline conv; split(conv, arr, "\t"); "curl -s -L https://rest.kegg.jp/link/ko/" arr[2] | getline link; if (link!="") print arr[1], link}'\''
>> uniprot_genes_ko.tsv'
⚠️ Note: Limit the number of parallel connections to 3 to avoid being blocked by the KEGG firewall.
Step 3: Convert KEGG BRITE JSON to TSV Format
Next, convert the KEGG BRITE hierarchical data from JSON to TSV format:
sed -E 's/^\t{2}"name"/\t\t"level 1"/g; s/^\t{3}"name"/\t\t\t"level 2"/g; s/^\t{4}"name"/\t\t\t\t"level 3"/g; s/^\t{5}"name"/\t\t\t\t\t"level 4"/g' json | awk '
BEGIN {OFS="\t"}
NR > 4 {match($0, /"([^"]+)": *("[^"]*")/, a)}
{tag = a[1]; val = gensub(/^"|"$/, "", "g", a[2]); f[tag] = val; if (tag == "level 4") {print f["level 1"], f["level 2"], f["level 3"], f["level 4"]}}' > brite.tsv
This will generate a hierarchical table in TSV format, which shows the first few rows like this:
Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|
09100 Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] | K00844 HK; hexokinase [EC:2.7.1.1] |
09100 Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] | K12407 GCK; glucokinase [EC:2.7.1.2] |
09100 Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] | K00845 glk; glucokinase [EC:2.7.1.2] |
09100 Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] | K25026 glk; glucokinase [EC:2.7.1.2] |
09100 Metabolism | 09101 Carbohydrate metabolism | 00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010] | K01810 GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9] |
Step 4: Clean and Reorganize the BRITE Data
To tidy up the BRITE data, you can run the following command:
awk -i inplace 'BEGIN {FS=OFS="\t"} {for (i=1; i<=3; i++) {sub(/[0-9]* /, "", $i)}; j=""; n=patsplit($4, a, /[^ ]*/); for (i=3; i<=n; i++) {j=j" "a[i]}; gsub(/^ /, "", j); print a[1], $1, $2, $3, j}' brite.tsv
This will create a cleaner output like:
KO ID | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|
K00844 | Metabolism | Carbohydrate metabolism | Glycolysis \/ Gluconeogenesis [PATH:ko00010] | HK; hexokinase [EC:2.7.1.1] |
K12407 | Metabolism | Carbohydrate metabolism | Glycolysis \/ Gluconeogenesis [PATH:ko00010] | GCK; glucokinase [EC:2.7.1.2] |
K00845 | Metabolism | Carbohydrate metabolism | Glycolysis \/ Gluconeogenesis [PATH:ko00010] | glk; glucokinase [EC:2.7.1.2] |
K25026 | Metabolism | Carbohydrate metabolism | Glycolysis \/ Gluconeogenesis [PATH:ko00010] | glk; glucokinase [EC:2.7.1.2] |
K01810 | Metabolism | Carbohydrate metabolism | Glycolysis \/ Gluconeogenesis [PATH:ko00010] | GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9] |
Step 5: Merge UniProt and KEGG BRITE Data
Now, merge the UniProt and BRITE data:
awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1][i++]=$2 FS $3 FS $4 FS $5; next} {split($3, b, ":"); split($1, c, ":"); if (b[2] in a) for (i in a[b[2]]) print c[2], b[2], a[b[2]][i]}' brite.tsv uniprot_genes_ko.tsv > uniprot_brite.tsv
The merged data will have rows like:
UniProt ID | KO ID | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|---|
Q6GZS4 | K12408 | Metabolism | Lipid metabolism | Primary bile acid biosynthesis [PATH:ko00120] | HSD3B7; cholest-5-ene-3beta,7alpha-diol 3beta-dehydrogenase [EC:1.1.1.181] |
P32234 | K06944 | Not Included in Pathway or Brite | Unclassified: metabolism | Enzymes with EC numbers | DRG, RBG; developmentally-regulated GTP-binding protein [EC:3.6.5.-] |
Q92AT0 | K21298 | Not Included in Pathway or Brite | Unclassified: metabolism | Enzymes with EC numbers | E2.4.1.333; 1,2-beta-oligoglucan phosphorylase [EC:2.4.1.333] |
P81928 | K23505 | Brite Hierarchies | Protein families: genetic information processing | Mitochondrial biogenesis [BR:ko03029] | TIMMDC1; complex I assembly factor TIMMDC1 |
P48347 | K06630 | Environmental Information Processing | Signal transduction | MAPK signaling pathway - yeast [PATH:ko04011] | YWHAE; 14-3-3 protein epsilon |
Step 6: Updating the Database (Optional)
To update the database, start by downloading the latest UniProt Swiss-Prot version. You can identify new entries by running:
awk 'fname!=FILENAME {fname=FILENAME; idx++} idx==1 {if ($0~/^>/) {match($0, /\|(.+)*\|/, id1); a[id1[1]][FILENAME]=b[id1[1]]+=1}} idx==2 {match($0, /:(.+)* /, id2); a[id2[1]][FILENAME]=b[id2[1]]+=1} END {for (i in b) {if (b[i]==1) {for (j in a[i]) {print i, j}} else if (b[i]>1) {j=""; for (k in a[i]) {j=j k " "}; print i, j}}}' uniprot_sprot.fasta uniprot_genes_ko.tsv > temp.tsv
Then, fetch and append the new entries:
for i in $(awk -v FS='\t' '{print $1}' temp.tsv); do
curl -s -L https://rest.kegg.jp/conv/genes/uniprot:${i} | awk 'BEGIN {FS=OFS="\t"} {"curl -s -L https://rest.kegg.jp/link/ko/" $2 | getline l; if (l!="") print $1, l}';
done >> uniprot_genes_ko.tsv
Conclusion
By following this guide, you can effectively classify gene functions into hierarchical groups using UniProt and KEGG BRITE. This approach allows you to understand which functional groups genes belong to in a structured and automated way, especially when working with metagenomics or large datasets.
[1] https://raw.githubusercontent.com/tiendu/tiendu.github.io/main/assets/files/uniprot_genes_ko.tsv
[2] https://raw.githubusercontent.com/tiendu/tiendu.github.io/main/assets/files/brite.tsv