As of December 2021, there are more than 220 million publicly available protein sequences stored in the UniProtKB protein database awaiting manual annotation by expert biocurators. That work will uncover the links between proteins and their functions in living organisms. It can improve understanding of diseases and advance drug discovery. But, because the volume of this data is so massive, those annotations are nearly impossible to do manually.
“It is essential to use artificial intelligence to develop accurate, automatic protein function annotation techniques,” says Bishnu Sarker, Ph.D., assistant professor, computer science and data science.
Some of those techniques have been developed, but Dr. Sarker explored a new path by learning numerical representation of proteins from a biomedical knowledge graph leveraging the power of generative adversarial network (GAN).
Dr. Sarker and his colleagues Marie-Dominique Devignes, associate researcher, CNRS; Guy Wolf, associate professor, Université de Montréal; and Sabeur Aridhi, associate professor in computer science, University of Lorraine, published their study as Prot-A-GAN: Automatic Protein Function Annotation using GAN-inspired Knowledge Graph Embedding.
Dr. Sarker and his partners used a knowledge graph to connect diverse sources of biological data relating to proteins to identify positive links, known as triples, between proteins and Gene Ontology (GO) annotations that describe the function of a particular protein. Then, they trained a complex machine learning model putting together the strengths of GAN, random-walk, and reinforcement learning. In Prot-A-GAN, all these ingredients are put together to build a powerful agent that automates the bio-curation for the purpose of protein function annotation. GAN aims at perfecting the agent with corrective feedback in producing credible annotations.
“GAN has two main components: 1) discriminator, 2) generator. We started by training a discriminator using domain-adaptive negative triples plus generated triples. The performance of the discriminator is fed to the generator to update it’s embeddings in the right direction,” says Dr. Sarker.
“These embeddings are then used by an agent that performs the random walk over the knowledge graph to identify paths between proteins and GO annotations,” says Dr. Sarker.
That random walk, guided by the GAN training model, identified new links between proteins and GO annotations and explained why they are linked. Future studies could explore links between a drug target, a drug candidate or a small molecule.
“The GAN approach is very complex in terms of training and learning, but it also gives you the advantage of a feedback loop that learns like a human,” says Dr. Sarker.
This approach also provides benefits that exceed other data science applications.
“Prot-A-GANis equally applicable for drug repurposing, where we identify how existing drugs can be repurposed to treat new diseases,” say Dr. Bishnu Sarker, assistant professor of computer science and data science. “It is a tremendous opportunity to make treatment more affordable.”
“What is especially promising is that while most data science methods predict relationships, this approach actually finds these links,” says Dr. Sarker.
Applying this adversarial training on a knowledge graph is very computationally expensive and requires advanced hardware.
“Knowledge graphs become more useful as you add more data. But then they become really computationally expensive. The GAN approach makes it even more complex, but provides those human-like learning advantages,” says Dr. Sarker.
All of this requires heavy computational resources and technologies that Dr. Sarker, who joined the School of Applied Computational Sciences in November 2021, will soon put to use via the School’s high-performance computing network that features two supercomputers.
“I had to do this experiment on a smaller scale as a proof of concept. I look forward to further exploring this path at Meharry with the high-performance computing network,” says Dr. Sarker.