Hierarchical Function Classification using UniProt and KEGG

When working on metagenomic projects, a common question arises: “How do we determine which functional group(s) a gene of interest belongs to?” In this guide, I will demonstrate an easy and effective way to classify genes using the KEGG and UniProt databases.

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a widely known resource for studying biological pathways. Beyond pathways, KEGG also provides insight into gene functions through a hierarchical classification system called KEGG BRITE, which organizes biological objects such as:

Genes and proteins
Compounds and reactions
Drugs
Diseases
Organisms and viruses

By combining KEGG BRITE with UniProt, you can categorize genes into their respective functional hierarchies. This tutorial will guide you through the steps to achieve this classification efficiently.

Step 1: Download Necessary Databases

First, download the following datasets:

After downloading and unzipping the UniProt Swiss-Prot file, you will get a FASTA file containing approximately 568,000 records. This will be used to extract UniProt IDs and map them to KEGG gene IDs, which will help link these genes to functional KO (KEGG Orthology) IDs.

Step 2: Map UniProt IDs to KEGG Gene and KO IDs

We will map the UniProt IDs to KEGG gene IDs and KO IDs using the KEGG API. You can use the following command to create a table with UniProt ID, KEGG Gene ID, and KO ID:

for i in $(awk '/^>/ {match($0, /\|(.+)*\|/, a); print a[1]}' uniprot_sprot.fasta); do
  curl -s -L https://rest.kegg.jp/conv/genes/uniprot:${i} | awk 'BEGIN {FS=OFS="\t"} {"curl -s -L https://rest.kegg.jp/link/ko/" $2 | getline l; if (l!="") print $1, l}';
done > uniprot_genes_ko.tsv

This command retrieves the gene and KO IDs from KEGG and stores them in a TSV file. Since the process can take a long time, consider limiting the number of simultaneous connections using the xargs command:

awk '/^>/ {match($0, /\|(.+)*\|/, a); print a[1]}' uniprot_sprot.fasta | xargs -P 3 -I {} bash -c '
  awk -v uniprot_id={} '\''BEGIN {FS=OFS="\t"; "curl -s -L https://rest.kegg.jp/conv/genes/uniprot:" uniprot_id | getline conv; split(conv, arr, "\t"); "curl -s -L https://rest.kegg.jp/link/ko/" arr[2] | getline link; if (link!="") print arr[1], link}'\''
  >> uniprot_genes_ko.tsv'

⚠️ Note: Limit the number of parallel connections to 3 to avoid being blocked by the KEGG firewall.

Step 3: Convert KEGG BRITE JSON to TSV Format

Next, convert the KEGG BRITE hierarchical data from JSON to TSV format:

sed -E 's/^\t{2}"name"/\t\t"level 1"/g; s/^\t{3}"name"/\t\t\t"level 2"/g; s/^\t{4}"name"/\t\t\t\t"level 3"/g; s/^\t{5}"name"/\t\t\t\t\t"level 4"/g' json | awk '
BEGIN {OFS="\t"} 
NR > 4 {match($0, /"([^"]+)": *("[^"]*")/, a)} 
{tag = a[1]; val = gensub(/^"|"$/, "", "g", a[2]); f[tag] = val; if (tag == "level 4") {print f["level 1"], f["level 2"], f["level 3"], f["level 4"]}}' > brite.tsv

This will generate a hierarchical table in TSV format, which shows the first few rows like this:

Level 1	Level 2	Level 3	Level 4
09100 Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]	K00844 HK; hexokinase [EC:2.7.1.1]
09100 Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]	K12407 GCK; glucokinase [EC:2.7.1.2]
09100 Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]	K00845 glk; glucokinase [EC:2.7.1.2]
09100 Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]	K25026 glk; glucokinase [EC:2.7.1.2]
09100 Metabolism	09101 Carbohydrate metabolism	00010 Glycolysis \/ Gluconeogenesis [PATH:ko00010]	K01810 GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]

Step 4: Clean and Reorganize the BRITE Data

To tidy up the BRITE data, you can run the following command:

awk -i inplace 'BEGIN {FS=OFS="\t"} {for (i=1; i<=3; i++) {sub(/[0-9]* /, "", $i)}; j=""; n=patsplit($4, a, /[^ ]*/); for (i=3; i<=n; i++) {j=j" "a[i]}; gsub(/^ /, "", j); print a[1], $1, $2, $3, j}' brite.tsv

This will create a cleaner output like:

KO ID	Level 1	Level 2	Level 3	Level 4
K00844	Metabolism	Carbohydrate metabolism	Glycolysis \/ Gluconeogenesis [PATH:ko00010]	HK; hexokinase [EC:2.7.1.1]
K12407	Metabolism	Carbohydrate metabolism	Glycolysis \/ Gluconeogenesis [PATH:ko00010]	GCK; glucokinase [EC:2.7.1.2]
K00845	Metabolism	Carbohydrate metabolism	Glycolysis \/ Gluconeogenesis [PATH:ko00010]	glk; glucokinase [EC:2.7.1.2]
K25026	Metabolism	Carbohydrate metabolism	Glycolysis \/ Gluconeogenesis [PATH:ko00010]	glk; glucokinase [EC:2.7.1.2]
K01810	Metabolism	Carbohydrate metabolism	Glycolysis \/ Gluconeogenesis [PATH:ko00010]	GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]

Step 5: Merge UniProt and KEGG BRITE Data

Now, merge the UniProt and BRITE data:

awk 'BEGIN {FS=OFS="\t"} FNR==NR {a[$1][i++]=$2 FS $3 FS $4 FS $5; next} {split($3, b, ":"); split($1, c, ":"); if (b[2] in a) for (i in a[b[2]]) print c[2], b[2], a[b[2]][i]}' brite.tsv uniprot_genes_ko.tsv > uniprot_brite.tsv

The merged data will have rows like:

UniProt ID	KO ID	Level 1	Level 2	Level 3	Level 4
Q6GZS4	K12408	Metabolism	Lipid metabolism	Primary bile acid biosynthesis [PATH:ko00120]	HSD3B7; cholest-5-ene-3beta,7alpha-diol 3beta-dehydrogenase [EC:1.1.1.181]
P32234	K06944	Not Included in Pathway or Brite	Unclassified: metabolism	Enzymes with EC numbers	DRG, RBG; developmentally-regulated GTP-binding protein [EC:3.6.5.-]
Q92AT0	K21298	Not Included in Pathway or Brite	Unclassified: metabolism	Enzymes with EC numbers	E2.4.1.333; 1,2-beta-oligoglucan phosphorylase [EC:2.4.1.333]
P81928	K23505	Brite Hierarchies	Protein families: genetic information processing	Mitochondrial biogenesis [BR:ko03029]	TIMMDC1; complex I assembly factor TIMMDC1
P48347	K06630	Environmental Information Processing	Signal transduction	MAPK signaling pathway - yeast [PATH:ko04011]	YWHAE; 14-3-3 protein epsilon

Step 6: Updating the Database (Optional)

To update the database, start by downloading the latest UniProt Swiss-Prot version. You can identify new entries by running:

awk 'fname!=FILENAME {fname=FILENAME; idx++} idx==1 {if ($0~/^>/) {match($0, /\|(.+)*\|/, id1); a[id1[1]][FILENAME]=b[id1[1]]+=1}} idx==2 {match($0, /:(.+)* /, id2); a[id2[1]][FILENAME]=b[id2[1]]+=1} END {for (i in b) {if (b[i]==1) {for (j in a[i]) {print i, j}} else if (b[i]>1) {j=""; for (k in a[i]) {j=j k " "}; print i, j}}}' uniprot_sprot.fasta uniprot_genes_ko.tsv > temp.tsv

Then, fetch and append the new entries:

for i in $(awk -v FS='\t' '{print $1}' temp.tsv); do
  curl -s -L https://rest.kegg.jp/conv/genes/uniprot:${i} | awk 'BEGIN {FS=OFS="\t"} {"curl -s -L https://rest.kegg.jp/link/ko/" $2 | getline l; if (l!="") print $1, l}';
done >> uniprot_genes_ko.tsv

Conclusion

By following this guide, you can effectively classify gene functions into hierarchical groups using UniProt and KEGG BRITE. This approach allows you to understand which functional groups genes belong to in a structured and automated way, especially when working with metagenomics or large datasets.

[1] https://raw.githubusercontent.com/tiendu/tiendu.github.io/main/assets/files/uniprot_genes_ko.tsv

[2] https://raw.githubusercontent.com/tiendu/tiendu.github.io/main/assets/files/brite.tsv