Grok 4: I Wanted to Believe

With legions of impatient X users standing guard, the release of xAI’s Grok 4 has finally happened. But man, the amount of ~~shills~~ enthusiastic, real life users reposting the same 10-15 use cases was off the charts. Coincidentally, this set of scenarios was full of self-contained, straightforward, close-ended problems. I desperately miss a demonstration of this PhD-level LLM, working on (or even better: solving) a real life problem that has no “correct” solution.

And seeing the tweet about copypasting whole codebases into the Grok chat window… Whoever wrote any kind of code beyond a simple Hello World should know that “whole codebases” don’t usually fit in a single file, and more important, don’t usually fit the surprisingly small context window of Grok 4.

So, what is the current state of Grok 4 (the standard version, not the “Heavy” model)? Can it move beyond clever demos and perform meaningful, complex work?

A New Benchmark for Raw Intelligence

There’s no denying that Grok 4 is a powerful model. On paper, it sets a new standard for raw intelligence, outperforming competitors on several difficult benchmarks. xAI claims it demonstrates PhD-level proficiency in disciplines like math, science, and reasoning. One of the most impressive results came from “Humanity’s Last Exam,” a benchmark of who knows how many “PhD-level” questions, where Grok 4 (with tools) successfully solved nearly 39% of the problems.

Its core features include (generated by Grok 3, reworked by my humble self):

A ~~large~~ context window: 128,000 tokens in the app and 256,000 tokens through its API. ~~While not the absolute largest on the market~~ While being 85% smaller than the current best of mainstream models, it provides ~~significant~~ some room for handling complex information.
Native tool use: It integrates real-time web search and has a code execution sandbox, allowing it to look up current information and test its own code suggestions.
Structured outputs: The model can generate structured data, which is useful for development and data analysis tasks.

Can It Handle a Real Codebase?

~~No.~~

This is the central question for many developers. The claim that Grok 4 can “handle entire codebases” needs clarification. You cannot paste a project with hundreds of files into the context window at once; the total size would be far too large.

However, this doesn’t mean it’s useless for large projects. Instead of feeding it the entire codebase, a more effective workflow involves:

Initial Analysis: Asking Grok 4 to analyze the project’s file structure to identify the most relevant files for a specific task.
Targeted Context: Loading only those key files into the context for the actual work.
Iterative Development: Using its capabilities for generating modular code, scaffolding new components, and understanding dependencies across the files you’ve provided.

What a shame there’s no VS Code integration, where most of the development work happens nowadays. I know, I should use Cursor. Whatever.

That said, xAI has acknowledged that the model can underperform in areas like coding and UI mockups compared to some competitors. The company has announced that a dedicated coding-specialist model is planned for release, signaling that the current Grok 4 is a generalist, not a coding-specific expert. Which is fine, I suppose.

On an unrelated note, image generation has the same issue. It’s not released yet, stick to the current generation of Aurora, which is way too liberal when it comes to the number of limbs or fingers humans may have.

The Unvarnished Truth: Limitations and Controversies

A hype-free assessment of Grok 4 must also acknowledge its significant limitations and the controversies surrounding it.

Training Data Transparency: If I understand correctly, as of mid-July 2025, xAI has not disclosed what data was used to train Grok 4. This is somewhat concerning where, before the release, there was an influx of X posts about deliberately curating and rewriting human knowledge and whatever.
Hallucinations and Bias: This would be pretty common among all LLMs, wouldn’t it? Well guess what, no, the model goes an extra mile and has been shown to consult Elon Musk’s social media posts and reflect his views when answering controversial questions.
Content Moderation Failures: Grok has faced significant public backlash for generating harmful content. Shortly before the release of Grok 4, the MechaHitler incident. I’m putting on a tin foil hat, and I’d bet it was a deliberate PR stunt. Change my mind.
Promised, Not Delivered: Full multimodal support, like the ability to process images as inputs, is planned but not yet delivered. The current model can generate basic visuals but cannot “see”, which results in funny results when parsing a PDF, for example.

The Verdict: A Powerful but Flawed Tool

So, can Grok 4 do real work? The answer is a qualified ~~yes~~ no maybe.

It is not a magical assistant that will understand your entire thousand-file project instantly. Using it effectively for complex development work requires skill, careful context management, and a good understanding of its limitations. Essentially you’ll face more or less the same cognitive load, but instead of concentrating on the code, you concentrate on the LLM. But isn’t that the case for every other model? Probably so.

However, it is not a polished, general-purpose assistant for everyday tasks; and more importantly, I don’t know what I could use it for in real life, but that might be my own shortcoming.

Ultimately, Grok 4 seems to still be a work in progress. I think the release was premature, but xAI cranked up the hype so much that they had to release something