Elon Musk's Colossus: The World's First 100,000 GPU Supercluster Will Likely Power Grok-3

xAI’s shiny new toy, the "Colossus"—officially known as the 100,000 H100 GPU Memphis supercluster—is now fully online. And not just on time, but a full four months ahead of schedule. In tech terms, that’s like a pizza delivery guy showing up at your door before you even finish ordering. From blueprint to blinking lights, Colossus was up and running in a jaw-dropping 122 days.

In our previous piece, we speculated that Colossus would be the missing puzzle piece that xAI needed to get its Grok model off the starting blocks and catch up with its competitors. Fast forward to today, and Grok-2, xAI’s current flagship LLM, has indeed hit the scene. But here’s the twist: despite all the fanfare, Grok-2 still lags behind ChatGPT-4o when it comes to real-world scenarios and benchmark tests (see our previous comparison).

But let’s be clear—Grok-2 was trained on a relatively modest 15,000 GPUs, and it’s already surpassed GPT-4 in terms of raw power. It just hasn’t dethroned the mighty GPT-4o yet. But it is now a different story with Colossus being fully operational and being touted as the “most powerful AI training system in the world”.

Achieving a 100,000-GPU Cluster is a Feat of Strength, Especially for a Newcomer like xAI

While Amazon, Microsoft, Google, and the usual AI giants have been busy trying to cobble together their own 100,000 GPU superclusters, none of them have come out and said, "Hey, we did it!" like xAI just did. And honestly, you can't really blame them. Building one of these behemoths isn’t like putting together IKEA furniture. For starters, you need to fork out around $4 billion just for the servers. But wait, there’s more! The state or country you are in needs to have the datacenter capacity and power to support your supercluster as GPUs need to be co-located for high-speed chip-to-chip networking.

According to SemiAnalysis, a well-respected semiconductor and AI research firm, keeping one of these 100,000 GPU clusters running will set you back a cool $131.9 million a year in the US alone.

But now that Colossus is fully up and running, xAI has managed to leap ahead of—or at least catch up with—the big players in the AI world. And it’s done so in a jaw-droppingly short 122 days. It’s like they fast-tracked building a skyscraper in the time it takes most people to renovate a kitchen.

100,000-GPU Clusters to Become a New Baseline for Frontier Models

There are some who believe that AI has hit a bit of a snooze button since GPT-4 burst onto the scene. And, well, they’re not entirely wrong. The issue isn’t that we’ve lost our imagination; it’s that nobody has managed to throw a staggering amount of compute power at a single model since then. Most of the AI models we’ve seen lately have been floating around GPT-4’s level, largely because the muscle—computing power—behind them hasn’t grown all that much.

Take Google’s Gemini Ultra, Nvidia’s Nemotron 340B, or Meta’s LLAMA 3 405B, for example. They all threw a ton of FLOPS (floating point operations per second) at the problem, on par or even higher than GPT-4. But here’s the catch—they were built on slightly clunkier architectures, which left them falling short of unlocking any new AI capabilities previously unseen.

So, what’s next? The obvious next step is to train a colossal, multi-trillion parameter, multimodal transformer that can gobble up video, images, audio, and text all at once. No one has quite managed that yet, but the race is on. And with Colossus now online, xAI seems to be pulling ahead of the pack. What does that mean? Well, it suggests we might soon see a Grok-3 model that could finally break through the ceiling and surpass everything else out there, especially since it’ll likely be the first model trained on a 100,000 GPU supercluster.

But wait, there’s more! As if 100,000 GPUs weren’t enough, Elon Musk recently tweeted (because of course he did) that xAI plans to double the GPU count to 200,000. That includes a new batch of 50,000 NVIDIA H200 GPUs set to roll in over the next few months.

If we follow the logic that the capability of an LLM is based on the amount of compute used to train it, does this mean that xAI’s Grok will soon be the leading frontier model? That is possible if Grok-3.5 (or something) is trained with 200,000 GPUs next year. But we shall see if a Grok-3 trained on Colossus proves this trajectory right.

Comments