Scientific Frontline: AI Without Hallucinations: Multi-Agent Protocol

Thursday, May 28, 2026

AI Without Hallucinations: Multi-Agent Protocol

Image Credit: Courtesy of Binghamton University

Scientific Frontline: Extended "At a Glance" Summary: Multi-Agent AI Verification Protocol

The Core Concept: A novel artificial intelligence protocol designed to eliminate hallucinations by forcing multiple large language models (LLMs) to reference authoritative databases and "vote" on the most accurate response.

Key Distinction/Mechanism: Unlike relying on a single generative AI model that might confidently produce false information, this method leverages retrieval-augmented generation (RAG) across multiple open-source chatbots. The models submit their answers for a consensus vote, ensuring the final output is rigorously validated by a majority of the AI agents.

Major Frameworks/Components:

Retrieval-Augmented Generation (RAG): Forces AI models to consult authoritative medical terminology databases before generating responses.
Multi-Agent Voting Mechanism: Utilizes an array of open-source LLMs (typically seven per experiment) to cross-verify answers and establish an evidence-based consensus.
Digital Twins: Dynamic, virtual replicas of physical processes continuously updated with real-time data to create predictive simulations for precision medicine.
Multi-Scale Network Models: Extracts and verifies evidence across varying data scales, ranging from multiomics to epidemiological and behavioral sources.

Branch of Science: Artificial Intelligence, Computer Science, Systems Science, and Biomedicine.

Future Application: The protocol will be deployed to advance precision medicine, optimize healthcare outcomes, and accurately map adverse drug reactions. Additionally, the framework can be adapted to eliminate fabricated legal citations, fake academic references, and historical errors in AI-generated text.

Why It Matters: By establishing a reliable, reproducible method for knowledge verification, this protocol ensures that generative AI can be safely trusted in high-stakes environments like healthcare, significantly mitigating the risks associated with AI hallucinations.

As chatbots powered by artificial intelligence become more ingrained in our everyday lives, people are increasingly using them to help diagnose their medical concerns.

Should I be worried about this rash? What if this insect bite gets infected? Is this pain the symptom of a larger problem? When dealing with someone’s health, the answers need to be as accurate as possible.

Last year, Binghamton University researchers tested OpenAI’s ChatGPT, and it showed high accuracy in identifying disease terms, drug names, and genetic information. However, the AI bot also generated a high number of false “hallucinations.”

A follow-up study funded by a $100,000 grant from New York State’s Empire AI Consortium may have found a way to eliminate that confidently delivered but fake information.

Ahmed Abdeen Hamed—a research fellow for the Thomas J. Watson College of Engineering and Applied Science’s School of Systems Science and Industrial Engineering—collaborated with George J. Klir Professor of Systems Science Luis M. Rocha to develop an innovative verification method, and the journal STAR Protocols recently published their conclusions.

From Plain Language to Diagnosis

The new protocol harnesses the growing number of open-source AI options, each of which has a different way to arrive at an answer to an inquiry. Hamed and Rocha chose seven of these large language models and forced them to use retrieval-augmented generation (RAG), which required them to reference an authoritative database of medical terminology before giving a response.

Over 10,000 experiments, the seven chatbots all received the same plain-language symptoms, and each of them came up with what it thought were the medical terms for them, complete with an official identification number. Then the bots put the answers up for a “vote.”

The result: 76.85% of the answers were supported by at least four LLMs, and the remaining 23.15% were supported by at least two. No unmatched terms—and no hallucinations.

“The new workflow is incredible,” Hamed said, “because it can verify anything from a biomedical point of view—biological knowledge with disease and genetics, translational knowledge from diseases to treatments and clinical trials, and also from a healthcare point of view with symptoms and treatments.”

A major advantage of this new protocol is that it can be reproduced in a near-infinite number of permutations to reinforce its accuracy.

“There can be 100 large language models that are open source, and every time we can perform an experiment with seven LLMs selected at random from that list,” Hamed said. “When we perform the experiment many, many times, we increase the confidence in the voting.”

Looking at Wider Applications

Rocha said the protocol is an important step toward increasing confidence in large multiscale network models of disease, which is a key topic for his Complex Adaptive Systems and Computational Intelligence Lab at Binghamton.

Among the research is the development of “digital twins” for precision medicine. These dynamic, virtual replicas of physical processes are continuously updated using AI and real-time data to create precise, predictive simulations of human reactions so that healthcare providers can optimize outcomes before real-world testing.

“For instance, the protocol can extract and provide multi-agent verification of evidence for an adverse drug reaction for a given medication that is available in clinical trials, the scientific literature, pharmacological databases, and even social media discourse,” Rocha said. “And it can assist in the extraction of evidence at multiple scales, from multiomics to epidemiological and behavioral data sources, which we have already started to pilot by building multilayer models of ER+ breast cancer.”

Hamed hailed the input from his collaborator as essential: “The guidance from Professor Rocha was huge, from securing the grant to helping to decide the direction of where this research would go and coaching us to develop the protocols needed to make it all work.”

Although the study centered on biomedical applications, the Binghamton team’s discovery could be used to curb or eliminate other kinds of LLM hallucinations, such as fabricated legal citations, fake academic citations, or blatant historical errors.

“This protocol is a big step toward the democratization of knowledge verification,” Hamed said.

Beyond Binghamton

With this research, Hamed wraps up his fellowship at Binghamton University and transitions to a new role as a research associate professor at the University of Nebraska–Lincoln.

“Dr. Hamed’s period in our lab was most productive, not only in the rapid development of AI-driven workflows and publications but in catalyzing new, creative ideas for all lab members,” Rocha said. “I cannot wait to see the amazing new research he will produce at the University of Nebraska–Lincoln.”

Hamed is grateful for the opportunities he received at Binghamton.

“Watson College provided an exceptional environment where I could fully develop and implement the forward-looking research agenda I began during my time in Europe,” he said. “The direction I envisioned was still emerging there at the time, and the fellowship offered the right setting to advance it. I’m hopeful that the resulting peer-reviewed publications can help shift perspectives and demonstrate how GenAI and LLMs can be used responsibly, constructively, and with genuine innovation.”

Published in journal: STAR Protocols

Title: Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

Authors: Ahmed Abdeen Hamed, and Luis M. Rocha

Source/Credit: Binghamton University | Chris Kocher

Edited by: Scientific Frontline

Reference Number: ai052826_01

Privacy Policy | Terms of Service | Contact Us