OpenAI's GPT o1 Boasts PhD-Level Skills, But Does It Deliver?

Sep 18, 20243 min read

If the last four months have felt suspiciously quiet on the AI frontier, brace yourself, because OpenAI has just casually dropped a new LLM family, o1. Yes, just four months after we were all getting cozy with GPT-4o, OpenAI has introduced us to a model that can supposedly solve harder problems, wrestle with trickier scientific puzzles, and, do your taxes (okay, maybe not that last part—yet).

The o1 family claims PhD-level brainpower, and not just in a "nice to have on a résumé" sort of way. No, we’re talking top-tier intelligence. This model ranks in the 89th percentile for competitive programming (Codeforces), rubs shoulders with the top 500 students in the U.S. Math Olympiad qualifier (AIME), and, if that's not enough, crushes human PhDs on a benchmark of physics, biology, and chemistry problems (GPQA).

GPT-4o VS. o1 Evals

These are only just a few of many benchmarks that OpenAI shared, all of which point to the same fact, o1 doesn’t just beat GPT-4o in a friendly competition; it leaves it in the dust. The performance difference is staggering, and naturally, we couldn’t resist pitting them head-to-head ourselves.

Is Math Still Too Much for o1?

Given that o1 is supposedly rocking a PhD in science and math, you'd think it would breeze through a high school math question from the June SAT, right? After all, the SAT is the kind of thing designed to trip up teenagers, not PhD-level AI models.

Q: In the given equation, k is a positive constant. The product of the solutions to the equation is 154. What is the value of k?

A: k = 12

With this in mind, let’s see what both GPT-4o and o1 produced.

ChatGPT-4o Answer

ChatGPT o1 Answer

Now, it was no shocker that GPT-4o didn’t quite hit the mark. After all, that’s why o1 exists in the first place, isn’t it? But here's where things get interesting: o1 didn’t get it right either. Sure, it was marginally closer to the actual answer than its predecessor, but close only counts in horseshoes and hand grenades. A university admission test stumped a model that’s supposed to excel at PhD-level science and math. We’re not talking about solving a Millennium Prize problem here—this was a standard SAT question.

To be fair, everyone deserves a second chance, even AI. So, we decided to put o1 through the paces again with a logic test previously given to both GPT-4 and Claude 3 Opus. Let’s see how that went...

Q: A cable of 80 meters (m) is hanging from the top of two poles that are both 50 m from the ground. What is the distance between the two poles, to one decimal place, if the center of the cable is 10m above the ground?

Before we dive headlong into this brain teaser, here's a little spoiler: it's less about crunching numbers and more about seeing the bigger picture. Imagine this: you've got a 50-meter pole with a cable dangling 40 meters down, just shy of the ground. But here's the twist—there's no cable stretching between two poles at all. The poles, in fact, are stacked on top of each other, a grand total of zero meters apart. The problem is more about perception than calculation. To note, Claude 3 Opus could answer it within seconds but ChatGPT-4 and 4o could not produce an answer at all.

ChatGPT o1 Answer

ChatGPT o1 did it! It did solve the puzzle, albeit after taking a good 75 seconds—quite the contrast to Claude 3 Opus's speedy <10 seconds. But still, credit where credit's due: o1 succeeded where its predecessors floundered. So, perhaps there’s a glimmer of hope after all.

That said, while o1 is undoubtedly smarter than GPT-4o, it’s not exactly miles ahead. Sure, it’s caught up to Claude 3 Opus in terms of reasoning, but when it comes to university-level math, it still struggles. So, why hasn’t OpenAI taken a bigger leap with its new models? Could it be because they don’t have access to a 100,000-GPU supercluster like xAI does? We might have to wait for the next Grok model to know for sure.

For now, o1 seems like a solid choice if you need help debugging code or solving tricky problems. But if you’re counting on it to write your PhD thesis or ace that math exam, well, you might want to double-check its work first.

Comments