Grok AI's Performance in Recent Benchmark Tests

Published 10 months ago

Elon Musk’s bold declaration about xAI’s Large Language Model (LLM) Grok being the “best that currently exists” in important respects has recently been put to the test. University of Toronto researcher Kieran Paster conducted a series of tests on various AI models using held-out math exam questions to gauge their capabilities.

Grok’s Outstanding Performance

Paster’s tests revealed that Grok outperformed all other LLMs, barring OpenAI’s GPT-4, by scoring a total of 59% compared to GPT-4’s 68%. Held-out questions are not part of the dataset used to train an AI model, hence, the model must rely on its previous training and problem-solving skills to answer them.

Comparing Grok with Other AI Models

In an additional test, Paster used a dataset of math word problems known as GSM8k to compare the performance of these models on new data. The interesting finding was that while OpenAI’s ChatGPT-3.5 scored higher than Grok on the GSM8k, it only managed to achieve half of Grok’s score on the held-out math exam. Paster concluded that ChatGPT-3.5’s outperformance on the GSM8k is likely due to overfitting, a phenomenon where an AI model gives accurate results for training data but performs poorly with new data.

Grok’s Ranking and Inference Capabilities

Excluding all models likely to suffer from overfitting, Grok impressively ranks third on the GSM8k, lagging only Claude 2 and GPT-4. This outcome suggests strong inference capabilities of Grok. However, a critical limitation in comparing these models is the lack of information about the number of training parameters used for GPT-4, Claude 2, and Grok. These parameters govern the learning process of an LLM, and generally, the more parameters, the more complex the AI model.

Grok’s Unique “Feel” for News

Grok’s beta testers have noted the AI model’s unique innate ability to distinguish between various biases in breaking news stories, likely due to its training on data from X. This ability further demonstrates the potential of Grok in understanding and interpreting complex information.