For chemists, the AI revolution has yet to happen

Raw Text

EDITORIAL

17 May 2023

Twitter

Facebook

Email

Download PDF

More than 20 years ago, the Cancer Research Screensaver harnessed distributed computing power to assess anti-cancer activity in molecules. Credit: James King-Holmes/SPL

Many people are expressing fears that artificial intelligence (AI) has gone too far — or risks doing so. Take Geoffrey Hinton, a prominent figure in AI, who recently resigned from his position at Google, citing the desire to speak out about the technology’s potential risks to society and human well-being.

But against those big-picture concerns, in many areas of science you will hear a different frustration being expressed more quietly: that AI has not yet gone far enough. One of those areas is chemistry, for which machine-learning tools promise a revolution in the way researchers seek and synthesize useful new substances. But a wholesale revolution has yet to happen — because of the lack of data available to feed hungry AI systems.

Any AI system is only as good as the data it is trained on. These systems rely on what are called neural networks, which their developers teach using training data sets that must be large, reliable and free of bias. If chemists want to harness the full potential of generative-AI tools, they need to help to establish such training data sets. More data are needed — both experimental and simulated — including historical data and otherwise obscure knowledge, such as that from unsuccessful experiments. And researchers must ensure that the resulting information is accessible. This task is still very much a work in progress.

Take, for example, AI tools that conduct retrosynthesis. These begin with a chemical structure a chemist wants to make, then work backwards to determine the best starting materials and sequence of reaction steps to make it. AI systems that implement this approach include 3N-MCTS, designed by researchers at the University of Munster in Germany and Shanghai University in China 1 . This combines a known search algorithm with three neural networks. Such tools have attracted attention, but few chemists have yet adopted them.

What's next for AlphaFold and the AI protein-folding revolution

To make accurate chemical predictions, an AI system needs sufficient knowledge of the specific chemical structures that different reactions work with. Chemists who discover a new reaction usually publish results exploring this, but often these are not exhaustive. Unless AI systems have comprehensive knowledge, they might end up suggesting starting materials with structures that would stop reactions working or lead to incorrect products 2 .

An example of mixed progress comes in what AI researchers call ‘inverse design’. In chemistry, this involves starting with desired physical properties and then identifying substances that have these properties, and that can, ideally, be made cheaply. For example, AI-based inverse design helped scientists to select optimal materials for making blue phosphorescent organic light-emitting diodes 3 .

Computational approaches to inverse design, which ask a model to suggest structures with the desired characteristics, are already in use in chemistry, and their outputs are routinely scrutinized by researchers. If AI is to outperform pre-existing computational tools in inverse design, it needs enough training data relating chemical structures to properties. But what is meant by ‘enough’ training data in this context depends on the type of AI used.

A generalist generative-AI system such as ChatGPT, developed by OpenAI in San Francisco, California, is simply data-hungry. To apply such a generative-AI system to chemistry, hundreds of thousands — or possibly even millions — of data points would be needed.

A more chemistry-focused AI approach trains the system on the structures and properties of molecules. In the language of AI, molecular structures are graphs. In molecules, chemical bonds connect atoms — just as edges connect nodes in graphs. Such AI systems fed with 5,000–10,000 data points can already beat conventional computational approaches to answering chemical questions 4 . The problem is that, in many cases, even 5,000 data points is far more than are currently available.

Artificial intelligence in structural biology is here to stay

The AlphaFold protein-structure-prediction tool 5 , arguably the most successful chemistry AI application, uses such a graph-representation approach. AlphaFold’s creators trained it on a formidable data set: the information in the Protein Data Bank, which was established in 1971 to collate the growing set of experimentally determined protein structures and currently contains more than 200,000 structures. AlphaFold provides an excellent example of the power AI systems can have when furnished with sufficient high-quality data.

So how can other AI systems create or access more and better chemistry data? One possible solution is to set up systems that pull data out of published research papers and existing databases, such as an algorithm created by researchers at the University of Cambridge, UK, that converts chemical names to structures 6 . This approach has accelerated progress in the use of AI in organic chemistry.

Another potential way to speed things up is to automate laboratory systems. Existing options include robotic materials-handling systems, which can be set up to make and measure compounds to test AI model outputs 7 , 8 . However, at present this capability is limited, because the systems can carry out only a relatively narrow range of chemical reactions compared with a human chemist.

AI developers can train their models using both real and simulated data. Researchers at the Massachusetts Institute of Technology in Cambridge have used this approach to create a graph-based model that can predict the optical properties of molecules, such as their colour 9 .

How AlphaFold can realize AI’s full potential in structural biology

There is another, particularly obvious solution: AI tools need open data. How people publish their papers must evolve to make data more accessible. This is one reason why Nature r equests that authors deposit their code and data in open repositories . It is also yet another reason to focus on data accessibility, above and beyond scientific crises surrounding the replication of results and high-profile retractions. Chemists are already addressing this issue with facilities such as the Open Reaction Database .

But even this might not be enough to allow AI tools to reach their full potential. The best possible training sets would also include data on negative outcomes, such as reaction conditions that don’t produce desired substances. And data need to be recorded in agreed and consistent formats, which they are not at present.

Chemistry applications require computer models to be better than the best human scientist. Only by taking steps to collect and share data will AI be able to meet expectations in chemistry and avoid becoming a case of hype over hope.

Nature 617 , 438 (2023)

doi: https://doi.org/10.1038/d41586-023-01612-x

References

Segler, M. H. S., Preuss, M. & Waller, M. P. Nature 555 , 604–610 (2018). Article   PubMed   Google Scholar

Struble, T. J. et al. J. Med. Chem. 63 , 8667–8682 (2020). Article   PubMed   Google Scholar

Kim, K. et al. npj Comp. Mater. 4 , 67 (2018). Article   Google Scholar

Yang, K. et al. J. Chem. Inf. Model. 59 , 3370–3388 (2019). Article   PubMed   Google Scholar

Jumper, J. et al. Nature 596 , 583–589 (2021). Article   PubMed   Google Scholar

Lowe, D. M., Corbett, P. T., Murray-Rust, P. & Glen, R. C. J. Chem. Inf. Model. 51 , 739–753 (2011). Article   PubMed   Google Scholar

Coley, C. W. et al. Science 365 , eaax1566 (2019). Article   PubMed   Google Scholar

Angello, N. H. et al. Science 378 , 399–405 (2022). Article   PubMed   Google Scholar

Greenman, K. P., Green, W. H. & Gomez-Bombarelli, R. Chem. Sci. 13 , 1152–1162 (2022). Article   PubMed   Google Scholar

Download references

Related Articles

What's next for AlphaFold and the AI protein-folding revolution

Artificial intelligence in structural biology is here to stay

How AlphaFold can realize AI’s full potential in structural biology

Read the paper: Highly accurate protein structure prediction with AlphaFold

Subjects

Chemistry

Structural biology

Computational biology and bioinformatics

Biological techniques

Drug discovery

Latest on:

Tighten US federal oversight of offshore wind development Correspondence 23 MAY 23

Organic catalyst opens way to energy-efficient chlorine production News & Views 17 MAY 23

CO2-mediated organocatalytic chlorine evolution under industrial conditions Article 17 MAY 23

EMC chaperone-CaV structure reveals an ion channel assembly intermediate Article 17 MAY 23

The role of NINJ1 protein in programmed cellular destruction News & Views 17 MAY 23

Structural basis of NINJ1-mediated plasma membrane rupture in cell death Article 17 MAY 23

Social media: generative AI could harm mental health Correspondence 23 MAY 23

Why AI’s diversity crisis matters, and how to tackle it Career Feature 19 MAY 23

Create an IPCC-like body to harness benefits and combat harms of digital tech Comment 17 MAY 23

Jobs

Postdoctoral Position New Orleans, Louisiana Tulane University

Postdoctoral fellow in neurodegenerative diseases The Fryer lab is seeking talented molecular biologists to develop biologics against Alzheimer's disease and related dementias Scottsdale, Arizona Mayo Clinic Neuroscience

Staff Scientist 1 recruiting for a Staff Scientist 1 in the Neurocognitive Aging Section (NAS) of the Laboratory of Behavioral Neuroscience Baltimore, Maryland National Institutes of Health

Postdoc in Computational mRNA Biology Job description APPLICATION CLOSING DATE: 22/06/2023 Human Technopole (HT) is a new interdisciplinary life science research institute, created and ... Milan (IT) Human Technopole

Postdoc in Machine Learning for Cancer Pharmacogenomics APPLICATION CLOSING DATE: July 3rd, 2023 Human Technopole (HT) is a new interdisciplinary life science research institute, created and supported by... Milan (IT) Human Technopole

Single Line Text

EDITORIAL. 17 May 2023. Twitter. Facebook. Email. Download PDF. More than 20 years ago, the Cancer Research Screensaver harnessed distributed computing power to assess anti-cancer activity in molecules. Credit: James King-Holmes/SPL. Many people are expressing fears that artificial intelligence (AI) has gone too far — or risks doing so. Take Geoffrey Hinton, a prominent figure in AI, who recently resigned from his position at Google, citing the desire to speak out about the technology’s potential risks to society and human well-being. But against those big-picture concerns, in many areas of science you will hear a different frustration being expressed more quietly: that AI has not yet gone far enough. One of those areas is chemistry, for which machine-learning tools promise a revolution in the way researchers seek and synthesize useful new substances. But a wholesale revolution has yet to happen — because of the lack of data available to feed hungry AI systems. Any AI system is only as good as the data it is trained on. These systems rely on what are called neural networks, which their developers teach using training data sets that must be large, reliable and free of bias. If chemists want to harness the full potential of generative-AI tools, they need to help to establish such training data sets. More data are needed — both experimental and simulated — including historical data and otherwise obscure knowledge, such as that from unsuccessful experiments. And researchers must ensure that the resulting information is accessible. This task is still very much a work in progress. Take, for example, AI tools that conduct retrosynthesis. These begin with a chemical structure a chemist wants to make, then work backwards to determine the best starting materials and sequence of reaction steps to make it. AI systems that implement this approach include 3N-MCTS, designed by researchers at the University of Munster in Germany and Shanghai University in China 1 . This combines a known search algorithm with three neural networks. Such tools have attracted attention, but few chemists have yet adopted them. What's next for AlphaFold and the AI protein-folding revolution. To make accurate chemical predictions, an AI system needs sufficient knowledge of the specific chemical structures that different reactions work with. Chemists who discover a new reaction usually publish results exploring this, but often these are not exhaustive. Unless AI systems have comprehensive knowledge, they might end up suggesting starting materials with structures that would stop reactions working or lead to incorrect products 2 . An example of mixed progress comes in what AI researchers call ‘inverse design’. In chemistry, this involves starting with desired physical properties and then identifying substances that have these properties, and that can, ideally, be made cheaply. For example, AI-based inverse design helped scientists to select optimal materials for making blue phosphorescent organic light-emitting diodes 3 . Computational approaches to inverse design, which ask a model to suggest structures with the desired characteristics, are already in use in chemistry, and their outputs are routinely scrutinized by researchers. If AI is to outperform pre-existing computational tools in inverse design, it needs enough training data relating chemical structures to properties. But what is meant by ‘enough’ training data in this context depends on the type of AI used. A generalist generative-AI system such as ChatGPT, developed by OpenAI in San Francisco, California, is simply data-hungry. To apply such a generative-AI system to chemistry, hundreds of thousands — or possibly even millions — of data points would be needed. A more chemistry-focused AI approach trains the system on the structures and properties of molecules. In the language of AI, molecular structures are graphs. In molecules, chemical bonds connect atoms — just as edges connect nodes in graphs. Such AI systems fed with 5,000–10,000 data points can already beat conventional computational approaches to answering chemical questions 4 . The problem is that, in many cases, even 5,000 data points is far more than are currently available. Artificial intelligence in structural biology is here to stay. The AlphaFold protein-structure-prediction tool 5 , arguably the most successful chemistry AI application, uses such a graph-representation approach. AlphaFold’s creators trained it on a formidable data set: the information in the Protein Data Bank, which was established in 1971 to collate the growing set of experimentally determined protein structures and currently contains more than 200,000 structures. AlphaFold provides an excellent example of the power AI systems can have when furnished with sufficient high-quality data. So how can other AI systems create or access more and better chemistry data? One possible solution is to set up systems that pull data out of published research papers and existing databases, such as an algorithm created by researchers at the University of Cambridge, UK, that converts chemical names to structures 6 . This approach has accelerated progress in the use of AI in organic chemistry. Another potential way to speed things up is to automate laboratory systems. Existing options include robotic materials-handling systems, which can be set up to make and measure compounds to test AI model outputs 7 , 8 . However, at present this capability is limited, because the systems can carry out only a relatively narrow range of chemical reactions compared with a human chemist. AI developers can train their models using both real and simulated data. Researchers at the Massachusetts Institute of Technology in Cambridge have used this approach to create a graph-based model that can predict the optical properties of molecules, such as their colour 9 . How AlphaFold can realize AI’s full potential in structural biology. There is another, particularly obvious solution: AI tools need open data. How people publish their papers must evolve to make data more accessible. This is one reason why Nature r equests that authors deposit their code and data in open repositories . It is also yet another reason to focus on data accessibility, above and beyond scientific crises surrounding the replication of results and high-profile retractions. Chemists are already addressing this issue with facilities such as the Open Reaction Database . But even this might not be enough to allow AI tools to reach their full potential. The best possible training sets would also include data on negative outcomes, such as reaction conditions that don’t produce desired substances. And data need to be recorded in agreed and consistent formats, which they are not at present. Chemistry applications require computer models to be better than the best human scientist. Only by taking steps to collect and share data will AI be able to meet expectations in chemistry and avoid becoming a case of hype over hope. Nature 617 , 438 (2023) doi: https://doi.org/10.1038/d41586-023-01612-x. References. Segler, M. H. S., Preuss, M. & Waller, M. P. Nature 555 , 604–610 (2018). Article   PubMed   Google Scholar. Struble, T. J. et al. J. Med. Chem. 63 , 8667–8682 (2020). Article   PubMed   Google Scholar. Kim, K. et al. npj Comp. Mater. 4 , 67 (2018). Article   Google Scholar. Yang, K. et al. J. Chem. Inf. Model. 59 , 3370–3388 (2019). Article   PubMed   Google Scholar. Jumper, J. et al. Nature 596 , 583–589 (2021). Article   PubMed   Google Scholar. Lowe, D. M., Corbett, P. T., Murray-Rust, P. & Glen, R. C. J. Chem. Inf. Model. 51 , 739–753 (2011). Article   PubMed   Google Scholar. Coley, C. W. et al. Science 365 , eaax1566 (2019). Article   PubMed   Google Scholar. Angello, N. H. et al. Science 378 , 399–405 (2022). Article   PubMed   Google Scholar. Greenman, K. P., Green, W. H. & Gomez-Bombarelli, R. Chem. Sci. 13 , 1152–1162 (2022). Article   PubMed   Google Scholar. Download references. Related Articles. What's next for AlphaFold and the AI protein-folding revolution. Artificial intelligence in structural biology is here to stay. How AlphaFold can realize AI’s full potential in structural biology. Read the paper: Highly accurate protein structure prediction with AlphaFold. Subjects. Chemistry. Structural biology. Computational biology and bioinformatics. Biological techniques. Drug discovery. Latest on: Tighten US federal oversight of offshore wind development Correspondence 23 MAY 23. Organic catalyst opens way to energy-efficient chlorine production News & Views 17 MAY 23. CO2-mediated organocatalytic chlorine evolution under industrial conditions Article 17 MAY 23. EMC chaperone-CaV structure reveals an ion channel assembly intermediate Article 17 MAY 23. The role of NINJ1 protein in programmed cellular destruction News & Views 17 MAY 23. Structural basis of NINJ1-mediated plasma membrane rupture in cell death Article 17 MAY 23. Social media: generative AI could harm mental health Correspondence 23 MAY 23. Why AI’s diversity crisis matters, and how to tackle it Career Feature 19 MAY 23. Create an IPCC-like body to harness benefits and combat harms of digital tech Comment 17 MAY 23. Jobs. Postdoctoral Position New Orleans, Louisiana Tulane University. Postdoctoral fellow in neurodegenerative diseases The Fryer lab is seeking talented molecular biologists to develop biologics against Alzheimer's disease and related dementias Scottsdale, Arizona Mayo Clinic Neuroscience. Staff Scientist 1 recruiting for a Staff Scientist 1 in the Neurocognitive Aging Section (NAS) of the Laboratory of Behavioral Neuroscience Baltimore, Maryland National Institutes of Health. Postdoc in Computational mRNA Biology Job description APPLICATION CLOSING DATE: 22/06/2023 Human Technopole (HT) is a new interdisciplinary life science research institute, created and ... Milan (IT) Human Technopole. Postdoc in Machine Learning for Cancer Pharmacogenomics APPLICATION CLOSING DATE: July 3rd, 2023 Human Technopole (HT) is a new interdisciplinary life science research institute, created and supported by... Milan (IT) Human Technopole.