If you've been following AI model releases lately, you've seen the headlines: Gemini 3.1 Pro is the smartest model ever built. It's scoring higher than GPT-5, crushing Claude, and doing things that were supposed to be impossible.
But here's the thing nobody's talking about: the smartest model isn't always the most useful one.
After weeks of using Gemini 3.1 Pro in production, I've hit a wall that feels like we're traveling back in time to 2024. The gap between what this model knows and what it can actually do is massive.
Let's break down what's happening—and what it means for how you should think about AI in production.
Google is absolutely destroying benchmarks right now. On every metric they publish, Gemini 3.1 Pro is winning:
The model is objectively smarter than anything we've seen. It knows more information, hallucinates less, and understands complex spatial reasoning (like skateboarding tricks) better than competitors.
But here's the problem: benchmarks don't measure what you actually care about in production.
When you're building production systems with AI, you don't care if a model can name skateboarding tricks. You care about:
On all of these fronts, Gemini 3.1 Pro fails spectacularly.
The model seems to randomly cycle through three states:
I've never seen any other modern model fail basic tool calls this consistently. Even open-source models from last year are more reliable.
This one's so bad that Google literally built a "potential loop detected" hook into their CLI. The model gets stuck in repetitive behaviors so often that they had to add detection for it.
Imagine you're migrating a database schema—a task that dozens of other models can handle easily. Gemini 3.1 Pro gets stuck, thinks it's in a loop, and just stops. You have to manually intervene to get it back on track.
The model seems hardcoded to read only 100 lines of a file at a time. I've watched it read lines 1-100, then 101-200, then 201-300 on the same file. It's agonizingly slow, and it burns through your token budget unnecessarily.
This is the core issue: Intelligence ≠ Competence.
Gemini 3.1 Pro is like stuffing infinite intelligence into a model from 2022. It has all the knowledge in the world, but it doesn't know how to behave as an agent.
Claude 4.5 Haiku is a perfect counterexample. On intelligence benchmarks, it scores a 37 (basically a toy model). But it never fails at tool calls. If you give it a tool and explain how to use it, it will use it correctly every single time.
Haiku is "dumber" but competent. Gemini 3.1 Pro is brilliant but unreliable.
The pattern here is clear: Google is "benchmaxing"—optimizing purely for benchmark scores while ignoring real-world usability.
Other labs (OpenAI, Anthropic) seem to be training their models on real chat histories. They generate thousands of fake coding sessions where models successfully complete tasks, then use that for reinforcement learning. The result: models that work well in production.
Google doesn't appear to be doing this. Or if they are, it's not working.
The evidence shows up in the "Meter eval" benchmark, which measures how long of a task a model can complete autonomously:
Gemini gets confused and lost when given tasks longer than a few minutes.
Here's the thing: there are times when pure intelligence wins.
If you need a model that knows obscure facts, understands complex spatial reasoning, or can generate perfect SVG animations from scratch, Gemini 3.1 Pro is unmatched. I've seen it do things with 3D reasoning that literally no other model can touch.
For example, I built a whole app with Gemini 3.1 Pro where models play Quiplash against each other. The jokes it generated were genuinely funny—surprisingly so. Other models found Gemini's responses the funniest, which I never expected from Google.
Similarly, on the Convex AI leaderboard, when given clear guidelines, Gemini 3.1 Pro scores a 95%—better than any other model. It understands the framework perfectly when you give it rules.
The pattern is clear: When you can constrain the model's behavior with strong system prompts or guardrails, its intelligence shines through. When you leave it to figure things out on its own, it flails.
If you're building AI systems in production, here's how I'd think about this:
Stop looking at raw intelligence scores. For production work, ask:
The smartest model is useless if the tooling around it is broken. Google's CLI is legitimately unusable—it randomly switches models, hides reasoning traces, and provides useless summaries.
Good tooling matters as much as the model itself. Cursor, Replit, and others are doing a better job making models usable than Google is.
Be skeptical when a lab claims their model is "the best ever" based purely on benchmark scores. Ask to see it work on real tasks—long ones, with tools, in actual production environments.
Google has the most intelligent model we've ever seen. What they don't have is a usable one.
The solution isn't to dumb down the model. It's to invest in:
Until then, Gemini 3.1 Pro will remain a fascinating research project—not something you want running your production systems.
When you're choosing an AI model for your business, don't let benchmark scores blind you. The smartest model isn't always the best one for the job.
Competence beats intelligence every time in production.
Gemini 3.1 Pro wins every benchmark it enters. But when it comes to getting actual work done? You're better off with a "dumber" model that knows how to follow instructions reliably.
The real question isn't "Which model is smartest?" It's "Which model will actually ship my feature without me babysitting it?"
Want to build AI systems that work in production, not just on benchmarks? Let's talk about how Medianeth can help you build reliable AI solutions that ship.
Founder & Lead Developer
With 8+ years building software from the Philippines, Jomar has served 50+ US, Australian, and UK clients. He specializes in construction SaaS, enterprise automation, and helping Western companies build high-performing Philippine development teams.
Ready to make your online presence shine? I'd love to chat about your project and how we can bring your ideas to life.
Free Consultation