With 3.3M+ people watching the launch, Elon Musk and his team introduced the world to “Grok 3”, the most capable and powerful model by x.AI to date. The company that started in 2023 and got its last model (Grok 2) out in 2024, is now challenging models by top companies like OpenAI, Google, and Meta that have been in the AI race for the last 5-7 years. All thanks to over 100K H100 NVIDIA GPUs! But DeepSeek, which also started its work in 2023, achieved o3-mini level capabilities with just a fraction of GPUs that Grok 3 did! In this blog, we will explore if Grok 3 is worth utilizing 100K+ H100 NVIDIA GPUs.
The NVIDIA H100 GPU is a high-performance processor built for AI training, inference, and high-performance computing (HPC). Being a successor to A100, it delivers faster processing, better efficiency, and improved scalability, making it a critical tool for modern AI applications. It is used by AI companies and research institutions, including OpenAI, Google, Meta, Tesla, and AWS, who rely on the NVIDIA H100 for developing cutting-edge AI solutions.
Also Read: Intel’s Gaudi 3: Setting New Standards with 40% Faster AI Acceleration than Nvidia H100
There are several reasons why major tech and AI companies around the world are investing in the NVIDIA H100 Chips:
100,000 H100 GPUs can break down massive problems (like training sophisticated AI models or running complex simulations) into many small tasks, and work on them all at once. This extraordinary parallel processing power means tasks that would normally take a very long time can be completed incredibly fast.
Imagine a simple task that takes 10 days to complete on a single H100 GPU. Now, let’s convert 10 days to seconds:
10 days ≈ 10 × 24 × 3600 = 864,000 seconds
If the task scales perfectly, with 100,000 GPUs the time required would be:
Time = 864,000 seconds ÷ 100,000 = 8.64 seconds
So a job that would have taken 10 days on one GPU could, in theory, be completed in less than 10 seconds with 100K GPUs working together!
Grok 3 is a successor to Grok 2, a model that did come with features like image generation on top of text. However, as a whole, it was subpar when compared to top models by OpenAI, Google, and Meta. That is why for Grok 3, Elon Musk’s x.AI wanted to catch up or in fact beat all the existing competitors in the field. That is why x.AI went big! They created a data center consisting of over 100K GPUs and expanded it further to 200K GPUs. That is why, in less than a year, they have been able to create Grok 3 – a model capable of advanced reasoning, enhanced thinking as well as deep research.
The performance difference between Grok 3 to Grok 2 is a clear indicates this leap.
Benchmark | Grok 2 mini (High) | Grok 3 (mini) |
Math (AIME2 ’24) | 72 | 80 |
Science (GPOA) | 68 | 78 |
Coding (LCB Oct–Feb) | 72 | 80 |
Almost a 10-point jump across all major benchmarks including Math, Science, and Coding! Impressive right? But is it impressive enough for the computing power of 100K H100 GPUs?
Also Read: Grok 3 is Here! And What It Can Do Will Blow Your Mind!
When DeepSeek-R1 was launched, it took the world by storm! All major AI companies could feel the heat due to their falling stock prices and decreasing user base as people flocked towards the open source marvel that challenged OpenAI’s best of the best! But to do this, did DeepSeek-R1 use 100K GPUs?
Well, not even a fraction of it! DeepSeek-R1 has been fine-tuned on top of the DeepSeek-V3 base model. DeepSeek-V3 has been trained on just 2048 NVIDIA H800 GPUs. (H800 GPUs are a China-specific variant of NVIDIA’s H100 GPUs, designed to comply with U.S. export restrictions with a smaller inference time). This essentially means that DeepSeek-R1 has been trained using just 2% of the computation compared to Grok 3.
As per the benchmarks, Grok 3 is significantly better than DeepSeek-R1 across all major fronts.
But is it true? Is Grok 3 truly better than DeepSeek-R1 and the rest of the other models as the benchmarks claim? Were 100K H100 GPUs really worth it?
Also Read: Grok 3 vs DeepSeek R1: Which is Better?
We will test Grok 3 against the top models including o1, DeepSeek-R1, and Gemini models for various tasks to see how it performs. To do this I will compare Grok 3 with a different model in each test, based on the outputs I receive from the two models. I will be evaluating the models on three different tasks:
I will then select the one that I find better based on the outputs.
Models: Grok 3 and Gemini 1.5 Pro with Deep Research
Prompt: “Give me a detailed report on the latest LLMs comparing them on all the available benchmarks.”
By Grok 3:
By Gemini 1.5 Pro with Deep Search:
Criteria | Grok 3 (Deep Research) | Gemini 1.5 Pro with Deep Search | Which is Better? |
Coverage of LLMs | Focuses on 5 models (Grok 3, GPT-4o, Claude 3.5, DeepSeek-R1, and Gemini 2.0 Pro). | Covers a wider range of models, including Grok 3, GPT-4o, Gemini Flash 2.0, Mistral, Mixtral, Llama 3, Command R+, and others. | Gemini |
Benchmark Variety | Math (AIME, MATH-500), Science (GPQA), Coding (HumanEval), and Chatbot Arena ELO score. | Includes all major benchmarks + multilingual, tool use and general reasoning, | Gemini |
Depth of Performance Analysis | Detailed benchmark-specific scores but lacks efficiency and deployment insights. | Provides broader performance analysis, covering both raw scores and real-world usability. | Gemini |
Efficiency Metrics (Context, Cost, Latency, etc.) | Not covered. | Includes API pricing, context window size, and inference latency. | Gemini |
Real-World Applications | Focuses only on benchmark numbers. | Covers practical use cases like AI assistants, business productivity, and enterprise tools. | Gemini |
Clearly, on each criterion, the report generated by Gemini 1.5 Pro Deep Search was better, more inclusive,, and more comprehensive of all the details around LLM benchmarks.
Models: Grok 3 and o1
Prompt: “If a wormhole and a black hole suddenly come near Earth from two opposing sides, what would happen?”
Response by Grok 3:
Response by o1:
Criteria | Grok 3 (Think) | o1 | Which is Better? |
Black Hole Effects | Simplified explanation, focusing on event horizon and spaghettification. | Detailed explanation of tidal forces, orbital disruption, and radiation. | o1 |
Wormhole Effects | Briefly mentions stability and travel potential. | Discusses stability, gravitational influence, and theoretical properties. | o1 |
Gravitational Impact on Earth | Mentions gravitational pull but lacks in-depth analysis. | Explains how the black hole dominates with stronger tidal forces. | o1 |
Interplay Between Both | Speculates about a possible link between the black hole and wormhole. | Describes gravitational tug-of-war and possible wormhole collapse. | o1 |
Potential for Earth’s Survival | Suggests the wormhole could be an escape route but is highly speculative. | Clearly states that survival is highly unlikely due to black hole’s forces. | o1 |
Scientific Depth | More general and practical, less detailed on physics. | Provides a structured, theoretical discussion on spacetime effects. | o1 |
Conclusion | Black hole dominates, and wormhole adds minor chaos. | Earth is destroyed by black hole forces. Wormhole’s role is uncertain. | o1 |
The result generated by o1 is better as it is more detailed, scientific, and well-structured compared to the result given by Grok 3.
Also Read: Grok 3 vs o3-mini: Which Model is Better?
Models: Grok 3 and DeepSeek-R1
Prompt: “What is the win probability of each team based on the image?”
Response by Grok 3:
Response by DeepSeek-R1:
Criteria | Grok 3 | DeepSeek-R1 | Which is Better? |
Win Probability (Afghanistan) | 55-60% | 70% | DeepSeek-R1 |
Win Probability (Pakistan) | 40-45% | 30% | Grok 3 |
Key Factors Considered | Includes historical trends, required run rate, team strengths, and pitch conditions. | Focuses on the final-over situation (9 runs needed, 2 wickets left). | Grok 3 |
Assumptions Made | Considers Pakistan’s ability to chase 316 and Afghanistan’s bowling attack. | Assumes Afghanistan will successfully chase the target. | Grok 3 |
Overall Conclusion | Afghanistan has a slight edge, but Pakistan has a reasonable chance depending on their chase. | Afghanistan is in a strong position, and Pakistan needs quick wickets. | Grok 3 |
Although the result given by DeepSeek-R1 was more accurate, Grok 3 gave a brilliant assessment of the match based on the image.
Now that we’ve seen how Grok 3 performs against competitors in various tasks, the real question remains: Was the massive investment in over 100K H100 GPUs justified?
While Grok 3 has demonstrated significant improvements over its predecessor and outperforms some models in specific areas, it consistently fails to dominate across the board. Other models, such as DeepSeek-R1 and OpenAI’s o1, achieved similar or superior results while utilizing significantly fewer computational resources.
Beyond the financial investment, powering and cooling a data center with 100K+ H100 GPUs comes with a massive energy burden. Each H100 GPU consumes up to 700W of power under full load. That means:
Grok 3’s energy-intensive approach may not be the most sustainable. OpenAI & Google are now focussing on smaller, more efficient architectures and energy-optimized training techniques, while x.AI has chosen brute-force computation.
Training AI models at scale is an expensive endeavor—not just in terms of hardware but also power consumption and operational costs.
By comparison, companies like OpenAI and Google optimize their training pipelines by employing mixture-of-experts (MoE) models, retrieval-augmented generation (RAG), and fine-tuning techniques to maximize efficiency while minimizing compute costs.
Meanwhile, open-source communities are demonstrating that high-quality AI models can be built with significantly lower resources. DeepSeek-R1 challenging industry leaders while being trained on just 2,048 H800 GPUs, is a prime example of this.
Hence, the development of a model like Grok 3 raises major concerns:
Grok 3 marks a significant leap for x.AI, demonstrating notable improvements over its predecessor. However, despite its 100K+ H100 GPU infrastructure, it failed to consistently outperform competitors like DeepSeek-R1, o1, and Gemini 1.5 Pro, which achieved comparable results with far fewer resources.
Beyond performance, the energy and financial costs of such massive GPU usage raise concerns about long-term sustainability. While x.AI prioritized raw power, rivals are achieving efficiency through optimized architectures and smarter training strategies.
So, were the 100K GPUs worth it? We don’t think so, at this point. If Grok 3 can’t consistently dominate, x.AI may need to rethink whether brute-force computation is the best path forward in the AI race.
Discover the power of xAI Grok 3, the smartest AI on Earth! Learn how 100K+ GPUs enhance its capabilities. Enroll in our course to explore its features and transform your projects today!
A. Grok 3 is x.AI’s latest LLM capable of performing tasks like advanced reasoning, enhanced reasoning and coding.
A. x.AI used 100K+ NVIDIA H100 GPUs to accelerate Grok 3’s training and improve its reasoning, research, and problem-solving abilities.
A. The estimated cost of training and running 100K GPUs includes millions of dollars in hardware, energy consumption, and maintenance costs.
A. DeepSeek-R1 was trained on just 2,048 GPUs but achieved competitive results. This shows that efficient AI training techniques can rival brute-force computation.
A. While more GPUs speed up training, AI companies like OpenAI and Google use optimized architectures, mixture-of-experts (MoE), and retrieval-augmented generation (RAG) to achieve similar results with fewer GPUs.
A. Despite using massive computational resources, Grok 3 did not consistently outperform competitors. Moreover, it struggled in tasks like advanced reasoning and deep search analysis.
A. While Grok 3 is a powerful AI model, the high cost, energy consumption, and performance inconsistencies suggest that a more efficient approach may have been a better strategy.