How is Machine Learning helping to develop TB drugs?
Many biologists use machine learning (ML) as a computational tool to analyze a massive amount of data, helping them to recognise potential new drugs. MIT researchers have now integrated a new feature into these types of machine learning algorithms, enhancing their prediction-making ability.
Using this new tool allows computer models to account for uncertainty in the data they are testing, MIT researchers detected several promising components that target a protein required by the bacteria that cause tuberculosis (TB).
Although computer scientists previously used this technique, they have not taken off in biology. “It could also prove useful in protein design and many other fields of biology,” says the Simons Professor of Mathematics and head of the Computation and Biology group in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) Bonnie Berger.
“This method is part of a known subfield of machine learning, but people have not brought it to biology,” states Berger. “It is a paradigm shift and is absolutely how biological exploration should be done.”
Assistant Professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, Bryan Bryson and Berger are the senior authors of the study that appears today in Cell Systems. Brian Hie, an MIT graduate student, is the paper’s lead author.
ML is a type of computer modeling in which an algorithm learns to predict based on data that it has already seen. In the past few years, biologists have begun machine learning to scour vast databases of potential drug compounds to find molecules that interact with specific targets.
The only limitation of this technique is that the algorithms perform well when the data they’re examining is similar to their training. Algorithms are not significantly superior at evaluating molecules that are very different from those they have already seen.
The researchers applied a method called the Gaussian process to assign uncertainty values to the data that the algorithms are trained on to overcome the obstacle. When the models are analyzing the training data that way, they also consider how reliable those predictions are.
For instance, if the data go into the model, it predicts how strongly a particular molecule binds to a target protein and the uncertainty of those predictions. The model can use that information to predict protein-target interactions that it hasn’t seen before. This model also forecasts the certainty of its predictions. While analyzing new data, the model’s predictions may have lower certainty for molecules different from the training data. This information can help researchers to decide which molecules to analyse experimentally.
Another advantage of this method is that algorithm requires only a tiny amount of training data. In the study, the MIT team trained the model with a dataset of 72 small molecules and their interactions with over 400 proteins called protein kinases. After that, they could use this algorithm to analyze roughly 11,000 small molecules which they took from the ZINC database. It’s a publicly available repository that consists of millions of chemical compounds. Many of these molecules were different from those in the training data.
The researchers identified molecules using this approach with powerful predicted binding affinities for the protein kinases they put into the model. These included three human kinases and one kinase called PknB found in Mycobacterium tuberculosis. This kinase is critical for the bacteria to survive, but any frontline TB antibiotics do not target it.
The researchers also used the same training data to train a traditional ML algorithm that does not integrate uncertainty and test the same 11,000 molecule library. Hie says, “The model gets horribly confused without uncertainty, and it proposes peculiar chemical structures as interacting with the kinases.”
A Good Starting Point
Another significant element of this method is that once the researchers get additional experimental data, they can add it to the model and retain it, further enhancing the predictions. Even a tiny amount of data can help the model get better.
Hie says, “You don’t really need enormous datasets on iteration. You can retrain the model with maybe ten new examples, which a biologist can quickly generate.”
Bryson says this study is the first to propose new molecules that can target PknB and provide drug developers with a good starting point to develop drugs that target the kinase. “We’ve provided them with some new leads beyond what has been already published.”
MIT researchers also showed that they could use this same type of machine learning to stimulate a green fluorescent protein’s fluorescent output. It is commonly used to label molecules inside living cells. Berger is now using it to analyze mutations that drive tumor development. “It could also be used for other types of biological studies,” he says.