All pharmaceuticals are synthesized by assembling simpler chemical starting materials through a precise sequence of chemical reactions. The design of this assembly process, known as synthesis planning, is among the most intellectually demanding tasks in science, particularly for complex molecules such as modern pharmaceuticals.
Expert synthetic chemists, when planning routes to complex drug molecules, make strategic decisions that influence every subsequent step. These include determining which ring structures to construct first, when to protect sensitive chemical groups from unwanted reactions, how to sequence transformations to avoid interference, and which reactions are robust enough for scale-up from laboratory to industrial production. Such judgment calls are developed through extensive training, intuition, and experimental experience, representing reasoning that textbooks can describe but not fully encode.
Computer-aided synthesis planning has existed for decades. However, a persistent gap remains between the computational capabilities of algorithms and the strategic reasoning employed by expert chemists. While existing systems can identify synthetic routes, they cannot evaluate their strategic quality.
Recent advancements have begun to address this gap.
A 2026 paA 2026 paper published in Matter by researchers from EPFL (Switzerland) and b12 Labs (San Francisco) demonstrates that large language models (LLMs), the same family of AI systems underlying tools such as ChatGPT and Claude, can bridge this gap. Rather than generating chemical structures themselves, which LLMs perform poorly, these models act as sophisticated reasoning engines that evaluate, rank, and guide chemical search algorithms toward expert-quality solutions. The proposed framework, Synthegy, aligns with independent expert chemists 71% of the time, a rate comparable to agreement among experts themselves. Problem With Existing Chemistry AI.
Understanding the significance of Synthegy requires consideration of prior developments in the field.
The history of computer-assisted chemistry goes back to the 1960s, when E.J. Corey developed the first computational retrosynthesis system. Retrosynthesis, the process of working backward from a target molecule to identify simpler building blocks that could produce it, is the foundation of synthetic planning. Modern systems have become remarkably sophisticated. Tools like AiZynthFinder, Reaxys, and Synthia can search through vast databases of known reactions, propose multi-step routes to complex molecules, and verify that starting materials are commercially available.
What they cannot do is think. They can find routes. However, these systems lack the capacity for strategic evaluation. While they can identify synthetic routes, they cannot assess whether those routes are strategically sensible, such as determining if a protecting group scheme is unnecessarily complex, if the sequence of ring formations is logically ordered, or if a particular reaction step poses a high risk of producing unwanted side products that could contaminate the final compound. They also have a valuable ability to understand chemical concepts. They can discuss functional groups, explain reaction mechanisms, and reason about strategic elements of synthesis in natural language. Their limitations have been different: they are poor at generating valid chemical representations, the precise molecular notation (SMILES strings) that computational chemistry requires. When asked to draw or output a molecule’s structure, they frequently produce chemically nonsensical results.
The EPFL team recognized the importance of leveraging LLMs for tasks aligned with their strengths, rather than assigning them tasks for which they are ill-suited.
Synthegy: The LLM as Chemical Judge, Not Chemical Generator
The Synthegy framework works through a clean division of labor. Traditional retrosynthesis algorithms do what they do best: systematically search chemical space and generate large pools of candidate synthesis routes. The LLM does what it does best: read those routes, reason about them strategically, score them against whatever criteria the chemist specifies in plain English, and rank the candidates.
A chemist using Synthegy might type something like: “Construct the pyrimidine ring in the early stages of the synthesis, and source the piperidine ring from commercially available starting materials.” Synthegy takes that natural-language specification, runs it alongside candidate routes generated by a standard synthesis engine, and returns a ranked list of routes with detailed written rationales explaining why each route does or doesn’t satisfy the strategic criteria.
Within this framework, the LLM functions as a chemistry evaluator rather than a generator, a role that aligns closely with its reasoning capabilities.
To test this, the researchers built a manually curated benchmark of molecule-prompt pairs spanning a range of synthesis complexity. They also generated an extended dataset of over 160 test cases semi-automatically. Across both evaluations, the performance rankings were clear and consistent: the most capable current AI models, particularly Gemini-2.5-Pro, outperformed all others, while smaller models performed barely better than random chance.
This finding has significant implications. Chemical reasoning at this level of sophistication appears to be an emergent capability that only manifests above a certain threshold of model scale and complexity. Smaller, less advanced models lack this capability. These results suggest that recent improvements in large AI models extend beyond general language proficiency to encompass domain-specific strategic reasoning.
Validated by 36 Expert Chemists
While benchmark scores provide a quantitative assessment, more compelling validation was obtained through a double-blind study involving 36 independent chemists of varying seniority, including PhD students, postdoctoral researchers, research scientists, and professors.
These experts were shown synthesis targets and pairs of candidate routes and asked to select the more feasible route, using the same feasibility criteria given to Synthegy. They never saw Synthegy’s outputs or scores. Their judgments were collected independently over a one-month period, yielding 368 individual evaluations.
The result: expert chemists agreed with Synthegy’s assessments 71.2% of the time overall. For 84.6% of the evaluated route pairs, at least half of the experts agreed with Synthegy’s choice. More senior experts (professors and research scientists) showed higher agreement rates than junior researchers, a finding consistent with the idea that Synthegy’s reasoning more closely mirrors the judgment of experienced practitioners.
Importantly, the study also revealed that experts frequently disagreed with one another on the most challenging cases, at rates comparable to their disagreements with Synthegy. Thus, Synthegy’s assessments fall within the normal range of expert-level judgment variance. While not a superhuman oracle, Synthegy performs at the level of a competent, experienced chemist.
Teaching AI to Trace Reaction Mechanisms
The second major application presented in the paper is even more technically ambitious: employing LLMs to elucidate the mechanisms of chemical reactions, specifically the step-by-step movement of electrons that transforms one molecule into another.
Understanding a reaction mechanism is fundamental to optimizing chemistry. If you know why a reaction produces the product it does, you can predict how it will behave under different conditions, modify it to work with new molecules, and understand why it sometimes fails. But mechanisms are hard to determine, and computational approaches have struggled to identify them in a generalizable way.
The EPFL team’s approach frames mechanism elucidation as a guided search problem. The system defines a basic set of elementary chemical moves, essentially, the fundamental ways electrons can move between atoms (attack moves and ionization moves). These elementary moves form the building blocks of all reaction mechanisms. The LLM evaluates proposed sequences of these moves at each step, scoring how chemically reasonable each possible next action is, and the search algorithm uses those scores to navigate toward a plausible complete mechanism.
Tested across twelve diverse reaction types from simple nucleophilic additions to complex multi-step processes, the best models achieved near-perfect performance on simpler reactions and meaningfully outperformed random guessing even on the most complex ones. Providing additional context to the LLM (such as expert intuitions about the mechanism or experimental data on reaction conditions) substantially improved performance.
The system can incorporate any information expressible in text, such as experimental kinetic data, known reaction conditions, or prior mechanistic hypotheses, thereby achieving a level of flexibility unattainable with hard-coded algorithms.
Why This Matters Beyond the Laboratory
The implications of this research extend beyond the confines of academic chemistry.
Drug discovery is one of the most expensive and time-consuming endeavors in human medicine. A typical drug takes 10–15 years and over $2 billion to bring from discovery to market. A substantial portion of that cost comes from the iterative, trial-and-error process of developing a viable synthesis, finding a manufacturable route to a complex molecule that works reliably at scale, without prohibitive side effects or costs.
Tools that enable chemists to rapidly screen hundreds of candidate synthesis routes against strategic criteria, including feasibility, cost, scalability, environmental impact, and regulatory requirements, could significantly reduce portions of the drug development timeline. The natural language interface is particularly important, as it allows chemists to specify strategic requirements without programming, thereby making advanced computational tools accessible to practitioners without coding expertise.
The researchers note that the capabilities of their framework are improving rapidly. The type of chemical reasoning demonstrated by Synthegy was largely absent from AI systems prior to mid-2024. Within two years, performance has reached a level comparable to expert human judgment. The team anticipates that future improvements will include direct, real-time guidance of search algorithms, rather than post hoc ranking, and the incorporation of experimental feedback in a closed loop, resulting in an AI system that learns from the outcomes of experiments it has helped design.
The Limits of the Current System
The researchers are transparent regarding Synthegy’s current limitations. It functions as a filter and ranker of routes generated by other tools, rather than as a synthesis planning engine itself. Its performance declines on the longest routes, those exceeding 50 reaction steps, where even the most advanced models struggle to maintain strategic context across numerous sequential decisions. Additionally, some models exhibit biases toward short, optimistic assessments that may overlook critical flaws.
And the system’s quality is constrained by the quality of the routes produced by the underlying retrosynthesis engines. If the search algorithm doesn’t include a genuinely good route in its candidate pool, Synthegy cannot select what isn’t there.
These limitations are current and may not be permanent. Given the rapid pace of capability improvement documented in the paper from near-zero performance in early 2024 to expert-level agreement within 18 months, it would be premature to consider these constraints as fixed.
The Bottom Line
Chemistry has long been regarded as both a science and an art. The artistic aspect, the strategic intuition that distinguishes a competent chemist from an exceptional one, has represented the final frontier of computational chemistry, remaining resistant to automation because it relies on flexible, context-based reasoning that traditional algorithms lack.
When employed as strategic reasoning engines rather than molecular generators, LLMs appear to have crossed a significant threshold in computational chemistry. The evidence provided by 36 independent experts is compelling. Chemistry AI is evolving beyond speed, beginning to demonstrate elements of expert-level judgment.
References
Bran, A.M., Neukomm, T.A., Armstrong, D., Jončev, Z., & Schwaller, P. (2026). Chemical reasoning in LLMs unlocks strategy-aware synthesis planning and reaction mechanism elucidation. Matter, 9, 102812. https://doi.org/10.1016/j.matt.2026.102812