Scientists increasingly rely on vast quantities of published studies to advance their research, making the ability to navigate this literature essential. A recent study conducted by researchers from Cornell University and Google aimed to assess the effectiveness of six large language models (LLMs) in interpreting scientific literature, specifically within the context of high-temperature cuprates—an important class of superconducting materials. The findings highlight both the potential and limitations of these AI systems.
The research, published in the Proceedings of the National Academy of Sciences on March 10, 2024, involved a panel of 12 experts who evaluated LLMs, including ChatGPT-4 and Claude 3.5, on their ability to understand complex scientific questions. Lead author Haoyu Guo, a postdoctoral fellow at Cornell’s Laboratory of Atomic and Solid State Physics, noted that the study aimed to determine whether LLMs could assist newcomers in grasping intricate fields of research.
To facilitate the evaluation, the researchers created a comprehensive database of 1,726 curated scientific papers covering the evolution of high-temperature cuprates. They also developed a set of 67 questions designed to assess deep comprehension of the material. The study examined the performance of four LLMs—ChatGPT-4, Claude 3.5, Perplexity, and Gemini Advanced Pro 1.5—alongside NotebookLM, a Google tool that provides answers based on specific documents.
The results indicated that systems utilizing curated information, such as NotebookLM and a custom retrieval-augmented generation (RAG) system, performed significantly better. “LLMs operating on trusted data sources—papers we collected ourselves, not from the LLM searching the Internet—tend to perform better,” Guo explained. This finding underscores the importance of reliable data in enhancing the accuracy of AI-generated responses.
Despite their strengths in extracting text-based information, all LLMs tested were “totally incapable” of engaging with data visualizations, according to co-author Eun-Ah Kim, the Hans A. Bethe Professor of Physics at Cornell. She emphasized the critical role that data visualization plays in scientific analysis, noting that the custom RAG model was notably superior in this area.
The researchers compiled a list of improvements for future LLMs. Key areas for development include more accurate attribution of claims made by the models, enhanced ability to synthesize complex information, and improved comprehension of visual data. “It has been about a year since we performed the benchmark, and we have seen improvements in many aspects,” Guo stated. “But visual reasoning is still underdeveloped.”
Kim expressed optimism regarding the potential for LLMs to support young researchers. “Knowing the facts used to be brandished as a ticket to the table. Holding a fact in your head should not be the ticket. The ticket should be: Do you know how to think in a creative way?” she said.
This study is the first from the National Science Foundation AI-Materials Institute, which Kim directs. As LLMs continue to evolve, their integration into scientific research could revolutionize how emerging scientists access and comprehend complex fields.