Aug 194 min read

Can Elon Musk’s Grok-2 Chatbot Compete with ChatGPT-4o in Real-World Usage?

xAI’s Grok-2 has just been released, and it has finally caught up and even surpassed some of the leading LLMs in certain areas. xAI conducted a comprehensive benchmarking of Grok-2 and Grok-2 mini; we’re talking about benchmarks in reasoning, reading comprehension, math, science, and coding—the grand trial that today’s AI contenders must face to prove their mettle.

Grok-2 Benchmarks

* GPT-4-Turbo and GPT-4o scores are from the May 2024 release. † Claude 3 Opus and Claude 3.5 Sonnet scores are from the June 2024 release. ‡ Grok-2 MMLU, MMLU-Pro, MMMU and MathVista were evaluated using 0-shot CoT. § For MATH, we present maj@1 results. ¶ For HumanEval, we report pass@1 benchmark scores. Source: xAI

Surprisingly—or perhaps not, if you’ve been following xAI’s trajectory—Grok-2 has outpaced GPT-4 Turbo. Now, let’s not get carried away: the reigning champs, GPT-4o and Claude 3.5 Sonnet, still hold the throne in most areas. However, Grok-2 has claimed the crown in one particularly tricky domain: visual math reasoning. With a MathVista benchmark score of 69.00%, it has surpassed every other LLM in math reasoning.

To put things in perspective, if you see the Chatbot Arena leaderboard—a sort of digital boxing ring for LLMs—Grok-2 is winning bouts left and right, only bowing out to the formidable ChatGPT-4o and the polished Gemini 1.5 Pro.

Grok-2 ELO Scores VS. Other Chatbots on Chatbot Arena

But let’s be honest: benchmarks and theoretical scores are like reading about food—informative, but not nearly as satisfying as the real thing. The true test of a chatbot’s prowess lies in how it handles the messy, unpredictable business of daily life. So, what happens when we pit Grok-2 against ChatGPT in a real-world showdown? Well, there’s only one way to find out…

Test 1: Real-time Data Access and Accuracy

First on the agenda: the ever-important task of extracting data from the internet and checking its accuracy. And what better topic to probe than the recently concluded Paris 2024 Olympic Games?

The prompt: "List all the medals won by the top five countries in the Paris 2024 Olympic Games."

ChatGPT-4o Response

ChatGPT-4o, with the calm efficiency of a seasoned librarian, zipped through four websites and presented the results in record time. The data was crisp, clear, and most importantly, correct—a well-executed performance, exactly what you'd expect from a top-tier generative AI.

Grok-2 Response

Grok-2, on the other hand, took a slightly more whimsical approach. It was quick, to be sure, and its response had a certain flair, but that’s where the good news ends. The data, unfortunately, was about as reliable as a weather forecast from a fortune cookie. Great Britain, proudly announced as third in the medal rankings, had actually finished in seventh place. Even more perplexing was Grok-2’s suggestion that these standings might still change, despite the inconvenient fact that the Olympics had already wrapped up.

The likely culprit? Grok-2’s penchant for sourcing its information from tweets, rather than pulling real-time data from the broader internet. So, if you’re in the business of serious research or need precise data, Grok-2 might not be your go-to guide.

Test 2: Image Generation

It's clear that when it comes to tasks beyond composing emails or firing off the occasional text, Grok-2 may not be your best buddy. But before we write it off entirely, there’s one more arena where Grok-2 might just pull a rabbit out of its digital hat: image generation. With DALL-E 3’s quality currently deteriorating, the timing couldn’t be better for a fresh contender to steal the limelight.

The prompt: Widescreen, realistic, image of two executives shaking hands in an office. The image should focus on the handshake.

ChatGPT-4o Response

Now, despite DALL-E’s recent downturn, it still managed to churn out a fairly decent image. The executives’ fingers were rendered with the kind of detail that would make a hand model proud, though they appeared to be shaking hands across what looked suspiciously like a banquet table. When asked to focus on the handshake, DALL-E took the instruction rather literally, to the point where everything else in the image became an afterthought.

Grok-2 Response

Grok-2, however, decided to up the ante. Its version was a visual delight, with realistic lighting, and the executives positioned in a way that didn’t suggest they were communicating across a chasm. On a purely aesthetic level, Grok-2’s creation was the clear winner. This might be thanks to Grok-2’s having little to no creative boundaries—rumor has it that Grok-2 is quite happy to generate images featuring copyrighted characters and even indulge in a bit of deepfaking (cue image of Donald Trump winning the Olympics below). This does bring up the rather juicy question of whether copyright laws are putting the brakes on the evolution of generative AI—but let’s save that debate for another day.

Grok-2 Generating an Image of Donald Trump Winning a Gold Medal at the Olympics

Overall, while Grok-2 is a marked improvement over its predecessor, it still stumbles in areas where precision is key, making it less than ideal for tasks that require more than just a knack for witty replies and quick responses. However, when it comes to generating images, Grok-2 delivers some eye-catching results, even if it doesn’t yet offer the editing features that more seasoned image-generative AIs do. However, there’s no denying the entertainment value—when Elon Musk proclaimed that “Grok is the most fun AI in the world,” he wasn’t exaggerating. The freedom from censorship in searches and the anything-goes approach to image creation are a refreshing change of pace in the often buttoned-up world of LLMs. For that reason alone, Grok-2 has certainly earned its spot among the current LLM heavyweights.

Elon Musk’s Tweet Upon Grok-2’s Release

Comments