Galileo Releases Hallucination Index for LLMs

Galileo, a frontrunner in machine learning for unstructured data, unveiled a Hallucination Index on November 15, 2023. This index, developed by Galileo Labs, aims to guide users of leading Large Language Models (LLMs) in selecting the model best suited to their needs and least likely to hallucinate.

The year 2023 has seen a rapid rise in the use of LLMs. However, the one-size-fits-all approach doesn’t work in this scenario and hallucinations pose a significant challenge to their adoption. Hallucinations in this context refer to the generation of information by AI that seems realistic but is incorrect or contextually disconnected.

The Hallucination Index: A Measure of AI Hallucinations

The Hallucination Index developed by Galileo Labs evaluates eleven LLMs from major AI companies, including Open AI, Meta, and Hugging Face. It measures each model’s propensity to hallucinate across common generative AI task types.

Key Findings of the Index

The key findings from the index include:

For Question & Answer without Retrieval tasks, OpenAI’s GPT-4 emerged as the top performer with a Correctness Score of 0.77. Among open-source models, Meta’s Llama-2-70b led with a Correctness Score of 0.65.
For Question & Answer with Retrieval tasks, OpenAI’s GPT-4-0613 excelled with a Context Adherence score of 0.76. Hugging Face’s Zephyr-7b, an open-source model, surpassed Meta’s Llama-2-70b with a Context Adherence Score of 0.71.
For Long-form Text Generation tasks, OpenAI’s GPT-4-0613 showed the least tendency to hallucinate. Meta’s open-source Llama-2-70b-chat rivaled GPT-4’s capabilities for this task.

Evaluating Performance and Cost

While OpenAI’s models outperformed their counterparts in terms of hallucinations, they come with higher costs due to their API-based pricing model. Lower-cost versions of their models, such as GPT-3.5-turbo, are available for organizations seeking to reduce expenses. Open source models also offer significant cost savings. For instance, Hugging Face’s Zephyr model can be more cost-effective for Question & Answer with Retrieval tasks.

Supporting Evaluation Metrics

Galileo’s proprietary evaluation metrics, Correctness and Context Adherence, support these analyses. These metrics are powered by ChainPoll, a hallucination detection methodology developed by Galileo Labs. During the creation of the index, these evaluation metrics were proven to detect hallucinations with 87% accuracy, thus providing a reliable way to automate hallucination risk detection.

Galileo’s mission is to unlock the value of unstructured data for machine learning. Given that more than 80% of the world’s data is unstructured, the company aims to provide the right data-focused tools to build high-performing models quickly.