Paper Highlight: “Roadmap to Pharmaceutically Relevant Reactivity Models Leveraging High-Throughput Experimentation”
Click here for the paper.
Who: Jessica Xu, Dipannita Kalyani, and other folks from MIT (Buchwald, Jensen) and Merck (Struble, Dreher, Krska).
Why: Even today, there aren’t great HTE datasets to use for ML. What’s the best way to generate that data and make reactivity predictions?
How: They designed a space of 15M possible Pd-catalyzed C-N coupling reactions and 1K conditions. They screened a small set of conditions over some representative substrates and picked tBuPhCPhos/LiOTMS to generate a training set (n=2000).
Analysis: LC area percent (LCAP) was used as the readout. Substrate quality was assessed by the reactivity on proxy reactions (smart!). Different train-test split strategies were examined. Random splitting was compared to a “dimensionality reduction split,” which means combining (all amines + one ArX) and (one amine + all ArX).
What was Found: Despite the pre-optimization of conditions, most substrates gave LCAPs of <20%. Reproducibility was moderate (±22% LCAP, which is to be expected from nanomole-scale HTE). Random splitting worked the best, with R^2 of ~0.6, and other splits don’t work as well. (It’s not obvious what the RMSEs were.) Neural networks and random forests looked better than linear regression.
Comments: This paper tackles two hard problems: making a high-quality dataset and modeling reactivity across a wide range of substrates. Doing chemistry at this scale is challenging because you have to design the space properly, ensure that the reagents are good, deal with the inherent noise in HTE, and interpret a lot of HPLC traces from different reactions. Analytically speaking, the use of LCAP as the readout is a concession to the impossibility of measuring response factors for every unique product and the immaturity of constant-response methods like charged aerosol detection (CAD). Running thousands of reactions and determining what happened is a huge amount of work, especially given the fact that neither the HPLCs or their analysis software were really designed for this task. We still have a long way to go before we can generate high-quality reactivity datasets with the same scale and efficiency as, say, genome sequencing data.
In terms of splitting strategies, one might imagine that stratified splits where every aryl halide and amine appears at least once might be a good idea. However, the fact that random splitting works best suggests that similar substrates can behave similarly. With respect to regression/classification methods, I think we’ll need bigger datasets across more reactions to really know which method is best.
How do we go from running singleton experiments to doing science at scale? IMO, it will take better automation, better analytical instrumentation, better software, and a new data-driven culture. We need more studies like this one.