Machine Learning is Hot!
I went to a conference at Amgen on machine learning (ML) in chemistry recently. ML techniques have proven to be very powerful for like image recognition, speech processing, or robot control. Below, I highlight some of the current lines of research and offer some ideas on where the field might be going. I think we should all pay attention to this fast-moving area, or risk being left behind.
Synthetic Route Planning
There has been rapid and impressive progress on the prediction of synthetic routes. Working with experts in computer science and cheminformatics, giant databases of literature reactions can now be traversed by algorithms that design synthetic routes to simple pharma-like molecules.
This technology will enable non-experts to make simple bonds, perhaps at contract research organizations or in non-chemistry-focused labs. It may make the preliminary identification of the optimal conditions for a given transformation less time-consuming for the busy chemist.
However, since routine synthetic planning is an extremely minor component of both academic and industrial research, I think we have a ways to go before route planning algorithms will truly accelerate basic research or drug discovery. I think synthetic planning programs may become quite interesting if they learn to tackle much more complex molecules, which currently remain the domain of a small number of experts.
Some claim that synthetic planning algorithms will be useful for programming robots that can perform general-purpose synthesis. Unfortunately, the development of these robots still faces significant challenges, so it will be quite some time before one can order up a molecule by pushing a button.
Quantitative Structure-Activity Relationships (QSAR)
There is a rich history in the pharmaceutical industry of using statistical methods to model the relationship between structure and compound properties. Traditionally, techniques that work well on relatively small datasets (101-103 points) like linear regression have been used. Now, there are efforts to use more sophisticated methods like random forest regression and deep learning.
Unfortunately, there seem to be a lot of challenges in this area. One fundamental problem is that more flexible ML techniques effectively have many more adjustable weights, and therefore require much more data (106 points or more) than is typically available. As a result, the choice of compound properties is strongly limited. A classic choice is logP, for which the results are still unfortunately rather medicore. Worse, deep learning only seems to be modestly better than “vanilla” methods, even in the hands of experts. For properties that are truly interesting like binding affinity, the results are not great. Future work will probably involve both advances in algorithms (e.g., better ways to input chemical data into neural networks) and experimental design (e.g., methods for gathering millions of data points about a chemical space of interest).
I want to mention a related area of research: the prediction of intrinsic molecular properties from structure. For example, one might try to predict electronic energy or electron density from structure. The dream is that we might be able to get very high-quality energies at essentially no computational cost by using a neural network. For example, one might train such a network on a set of perturbed structures and their coupled-cluster-quality energies, and then use this “forcefield” to drive molecular dynamics simulations. So far, there have been successes in predicting DFT energies from structures.
Designing New Compounds
The ultimate goal of QSAR is to design new compounds. However, one problem is that molecules are fundamentally made of discrete units (viz., atoms) while mathematical techniques for optimization are designed for continuous quantities. As a result, both QSAR models cannot easily be used for design. For example, in optimizing the binding affinity of a molecule, one might have a reasonable linear fit to a one-change-at-a-time series like methyl, ethyl, propyl, etc. However, an entirely different line might be necessary for the same series with a different scaffold, forcing global QSAR models to use categorical variables to describe different scaffolds.
Sigman and co-workers have pioneered the use of multi-linear regression to create multi-dimensional free energy relationships to handle this situation. (More recently, his group has extended this approach to include non-energy-related independent variables like vibrational frequencies.) A very different approach from ML is to use “latent space optimization.” The notion, recently proposed by Aspuru-Guzik and co-workers, is first to convert discrete molecular representations, like molecular graphs, to a continuous latent space. Optimization is performed inside this space, and then the optimal point is converted back into a real molecule at the end. This is a very cool idea, but I’m unclear on what the identity of the latent space should be. According to some unpublished work that Regina Barzilay presented, some latent spaces may have “cliffs” inside them: broad regions that map to one molecule being adjacent to other broad regions that map to entirely different molecules. I expect that sovling such problems will be an interesting area of research for many years.
Given the impressive inroads ML has made in both traditional problems in computer science (e.g., computer vision or game playing) and other fields (e.g., tumor detection and high-energy particle collision analysis), the excitement about the application of ML to chemistry is understandable. However, in my opinion, the set of chemical problems that will benefit the most from ML remains unclear. For example, the application of ML to reaction condition discovery may well prove to very powerful. One might imagine that screening robots might one day screen millions of reactions for any given reaction in an unsupervised fashion, thus replacing the “one-at-a-time” grids of data that graduate students so laboriously generate today. Whether this fantasy will come to pass remains to be seen, and it’s entirely possible (and perhaps likely) that good old-fashioned screening and mechanistic work will still be the most efficient method twenty years from now.
Conway’s Law comes from software engineering, but applies equally to ML/chemistry collaborations. The law states that organization of a software package tends to reflect the underlying organization of the company that produced it. At the moment, the groups producing ML/chemistry work seem to be heavily weighted towards either ML or chemistry. As a result, there are many ML-heavy papers that are unable to identify compelling chemical applications, while there are many chemistry-heavy papers that could be using better analysis methods. In the latter case, there has been recent controversy over the appropriate application of statistical controls to “big data” studies, where coincidental correlations are rather likely. I hope that as more of these studies get done, and the practice of making raw data openly available becomes widespread, rigorous statistical controls will be routinely and easily applied to all studies.
As an antidote to irrational enthusiasm, and in some cases, dare I say, hysteria, over the application of ML to chemistry, I offer the following procedure: whenever the terms “machine learning” or “artificial intelligence” are applied to a topic of chemical research, substitute “statistics,” and ask yourself whether the project is as exciting. Thus, are you equally excited by “machine learning in drug discovery” as you are by “statistics in drug discovery”? Put another way, I don’t think chemists will be interested in ML for its own sake (though computer scientists might be). There are truly exciting things that will come out of this field in the coming years, but I would suggest that apply the same skepticism we apply to other fields to this one.