New hallucination evaluation benchmark shows that the new model is more accurate at catching hallucinations than GPT-4o, GPT-4-Turbo, Claude-3 and industry solutions
Patronus AI announced the release of Lynx, the State-of-the-Art hallucination detection model designed to address the challenge of hallucinations in large language models (LLMs).
Hallucinations occur when LLMs generate responses that are coherent but do not align with factual reality or the input context, undermining their practical utility across various applications. While traditional proprietary LLMs, like GPT-4, have become used to detect these inconsistencies in recent times (‘LLM-as-a-judge’), there are concerns over their reliability, scalability, and cost.
Read: AI In Marketing: Why GenAI Should Be in All 2024 Marketing Plans?
Lynx represents a breakthrough in the field by enabling real-time hallucination detection without the need for manual annotation. Patronus AI also open sourced HaluBench, a new benchmark sourced from real-world domains, to assess faithfulness in LLM responses comprehensively.
“Since the release of ChatGPT in November 2022, the proliferation of large language models has revolutionized text generation and knowledge-intensive tasks like question answering. However, hallucinations, where models produce coherent but inaccurate responses, remains a critical challenge and poses significant risks for enterprises,” said Anand Kannappan. “We address this challenge head-on with Lynx, a groundbreaking open source model capable of real-time hallucination detection. Today, we not only introduce the most powerful LLM-as-a-judge with Lynx, we also introduce HaluBench, a novel 15k sample benchmark that LLM developers can use to measure the hallucination rate of their fine-tuned LLMs in domain-specific scenarios.”
Lynx is the first model that beats GPT-4 on hallucination tasks. Lynx (70B) achieved the highest accuracy at detecting hallucinations, compared to all other LLMs used as judges, making it the largest and most powerful open source hallucination model to date. It outperformed OpenAI’s GPT models and Anthropic’s Claude 3 models at a fraction of the size.
Also Read: Beta Systems Unveils Lighthouse Program for Data Center Solutions
Lynx and HaluBench also support real world domains like Finance and Medicine, which previous datasets and models did not include, making it more applicable to real world problems.
Results:
- In medical answers (PubMedQA), Lynx (70B) was 8.3% more accurate than GPT-4o at detecting medical inaccuracies.
- Lynx (8B) outperformed GPT-3.5 by 24.5% on HaluBench, and beat Claude-3-Sonnet and Claude-3-Haiku by 8.6% and 18.4% respectively, showing strong capabilities in a smaller model.
- Both Lynx (8B) and Lynx (70B) achieve significantly increased accuracy compared to open source model baselines, with Lynx (8B) showing gains of 13.3% over Llama-3-8B-Instruct from supervised finetuning.
- Lynx (70B) outperformed GPT-3.5 by an average of 29.0% across all tasks.
Read:How AI Is Transforming Big Data?
[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]