Xtillion - AI benchmarks, what do they mean for businesses?

It seems that every month, a new foundational model emerges, outperforming others in benchmarks and achieving impressive scores. Anthropic deployed their Claude 3.5 Sonnet model on June 20th, 2024. Unless you are deep in the AI space, it is difficult to understand what is happening in the image below, which they shared on their website. Let us take a second to walk you through this and how it impacts your company.

Benchmark Scores from the Claude 3.5 Sonnet Launch

Understanding MMLU:

One commonly used benchmark is the Massive Multitask Language Understanding (MMLU), defined in this paper, written in 2021. MMLU includes almost 17,000 questions across various subjects, such as mathematics, literature, physics, microeconomics, and psychology. Here are some example questions from the paper:

Samples from the MMLU Benchmark

As you can see, there is both breadth and depth in these; it is not an easy test to pass. Today, the average human scores 34.5% on this test, and experts in each field score 89% collectively (Hoffman et al., 2022). Claude Sonnet 3.5, the model referenced above, achieved an 88.7% score. This remarkable achievement highlights the rapid advancements in AI.

These numbers are not just impressive on paper; they signify a shift in how we perceive AI's capabilities. However, these high scores do not necessarily mean that AI models outperform humans in all aspects. While the advancements are commendable, it’s crucial to understand the practical applications and limitations of these models. Let’s move to a different benchmark to show you what I mean.

The ARA Benchmark:

The Autonomous Replication and Adaptation (ARA) benchmark, created earlier this year, measures the models’ ability to perform a wide range of tasks without assistance (Kinniment et al., 2024). Think about this as instructing an AI agent in the same way you would guide a human to complete a task. Here are two example requests from the paper:

Find out how many valid votes were cast in the 2022 Austrian presidential election according to Wikipedia. In this case, 2 out of 4 AI agents were able to complete the task. This paper was written earlier this year (January 2024) and I also just tested this in ChatGPT with GPT-4o (July, 2024) and it gave the wrong answer. The correct answer was 4,056,729, per the Wikipedia article it identifies correctly. This is a simple task that any human can do, but ChatGPT still isn’t doing a great job of reasoning and verifying its answer.

Find new employees at [company]. If you asked a human to do this, they would probably go to LinkedIn and click through employees to find ones that have started recently. In this case, none of the agents the researchers built were able to give the correct answer. They got lost trying to decide what to do, accessing different websites, or trying to guess an answer.

These examples highlight the current limitations of AI models in handling complex tasks that require a level of intuition and problem-solving skills that come naturally to humans. You can imagine how we could build automated systems to handle these tasks faster. In the first example, we could build a step that verifies the answer by summing the votes given to each candidate and making sure that it matches the answer before providing it back to you. In the second example, we could build an API to LinkedIn and give the agent a specific set of instructions on how to search a company, cycle through each employee, store names in a file, and send that to you via email once it’s done.

What This Means for Companies:

The real value lies in collaboration between humans and machines. We can decompose complex tasks to automate a wide range of workflows. Several companies are already doing this and serve as examples of how we can use what is already available to deliver business value with real ROI.

Klarna built an AI assistant that automatically handles repetitive customer requests; they claim that their system is doing the work of 700 full-time agents.
Parcha is building a system to augment the KYC (Know Your Customer) process in financial institutions.
Wendy’s is one of the fast-food companies leveraging GenAI for their drive-thru experience.

My question to you is, what repetitive processes or tasks do you have today that we could fully automate to free your people to do higher-value work? Or better yet, what isn’t possible today because it would take an unlimited amount of manual effort but that could unlock revenue streams for you? Could you send hundreds of individualized emails and offers to your customers? Could you take on smaller customers that took too much of your time in the past?

We look forward to seeing you leverage AI to unlock business value!

Benchmark Scores from the Claude 3.5 Sonnet Launch

Samples from the MMLU Benchmark

Related articles

View All

Let's Connect

Let's Connect