Back to Blog
OpenAI

GPT-5.3 CodeX: The First AI That Feels Like a Real Colleague

Jomar Montuya
February 7, 2026
10 minutes read

OpenAI just dropped GPT-5.3 CodeX, and after weeks of using it across dozens of projects, I can tell you this: It's the first AI coding model that actually feels like a colleague, not a tool.

Not because it's smarter (it is), but because it communicates better, works longer without hand-holding, and makes decisions I would make myself.

Here's what the early access experience looks like.

The Problem With Previous Models

Before 5.3, AI coding models had a pattern:

  1. Start task
  2. Go silent for minutes
  3. Deliver result or crash

You had no idea what was happening until it was done. If it went wrong, you wasted time and had to restart from scratch.

GPT-4.5 Opus was better at explaining what it was doing, but 5.2 Codex? Complete silence. You'd paste a task, watch it spin, and hope for the best.

This is fine for small tasks. But for complex work—migrations, refactors, multi-hour projects—it's anxiety-inducing. Is it working? Did it get confused? Should I kill it?

You never knew until it was too late.

What Makes 5.3 Different

5.3 CodeX solves this in three ways:

1. It Tells You What It's Doing

It gives you context in real time. Not just "I'm working," but specific details:

  • "I'm looking at paths that involve these keywords"
  • "Found this thing that's calling something it shouldn't"
  • "I'm going to test this theory by doing X"
  • "Here's what I think is happening and why"

This isn't just nice—it's practical. You can steer it when it's going the wrong direction before it's too late. You can say, "Actually, don't touch that package," and it adjusts.

This is the "send now" button flow. Cursor and Cloud Code both let you send a message immediately to interrupt or redirect the current work. 5.3 actually makes that useful because you have something to redirect it from.

2. It Works Longer Without Hand-Holding

Previous models had a 5-hour limit. Meaningful tasks got stuck at the halfway mark.

5.3 has runs that stay on track for 8+ hours. The early access tester who reviewed this migrated a 9-year-old codebase from ancient versions of Next.js, React, Prisma, and Tailwind—literally hundreds of breaking changes.

5.3 did it almost all in one shot. It encountered breaking changes, deprecated packages, syntax mismatches, and handled them strategically.

What's impressive isn't that it could do the work. It's that it did it in one pass where 5.2 Codex would have needed 20+ manual interventions.

3. It Makes Strategic Decisions

This is where 5.3 really shines. It doesn't just patch things randomly like 5.2 did.

When it needs to upgrade Next.js, that upgrade might break tRPC bindings. 5.2 would try to force migrate everything and spiral into a dependency hell.

5.3 patches the old package to make the upgrade work, finishes the upgrade, then comes back and deprecates the old package.

That's how a senior engineer thinks. Not "fix everything at once," but "make the necessary changes in the right order."

The tester who used this called it "one of the most impressive things I've seen a model do." Engineers who think in modular boxes don't work this way. 5.3 does.

Real-World Impact: What It Actually Built

The tester built a production-ready OAuth solution called Sho.dev using 5.3 CodeX. Every line of code was written by the model, and they deployed it to Railway.

They also migrated the ping.gg codebase mentioned above—a project they described as "my favorite task to throw at new models knowing they couldn't do it."

5.3 did it.

This isn't theoretical. It's actual shipping work.

The Autonomy Gap

Let me show you the difference with a concrete example.

The tester ran a codebase audit with 5.2 Codex:

  • Time: Claimed 6 minutes, actually took significantly longer
  • Context: Told you nothing until it was done
  • Accuracy: Did okay work but left questions unanswered

With 5.3:

  • Time: Faster overall despite doing more work
  • Context: Kept you informed throughout
  • Accuracy: Better because it could self-correct mid-work

But the real difference is the autonomy. With 5.2, you wait passively. With 5.3, you engage proactively—you can stop it, redirect it, add context, and watch it adjust.

It feels less like a black box and more like working with someone.

The Comparison: 5.2 vs 5.3

Here's what the early access tester found:

Metric5.2 Codex5.3 CodeX
SpeedBaseline2x faster
Tool calls~12 per task~5.5 per task
AutonomySet-and-come-backInteractive
ContextZero until doneContinuous
PatchesAggressive, riskyStrategic, minimal
Long tasks5-hour max8+ hours

5.3 uses fewer tool calls but gets more done. It's more efficient with tokens and gives you better feedback throughout.

The "Send Now" Workflow

This is the UX upgrade that matters:

  1. Task runs → 5.3 tells you what it's doing
  2. Wrong path? → Hit "send now" → Redirect it
  3. Going well? → Wait or check in later
  4. Done? → It tells you and cues follow-ups

You're not paralyzed waiting for completion. You're actively collaborating.

The Benchmarks

OpenAI is claiming 5.3 is state-of-the-art on SWEB Pro, Terminal Bench, OS World, and GDPS eval—some of the most rigorous coding benchmarks.

I don't take benchmark claims at face value anymore. Too many times, the "best model" doesn't ship better work.

But here's what I can tell you from real use:

  • SWEB Pro: Believable given what it's doing with large codebases
  • Terminal Bench: Makes sense—its bash skills are significantly better
  • GDPS: Hard to verify without API access (more on this below)

The early access tester called it "significantly more autonomous" than Claude 4.5 and said it was "enjoying watching it work." That's rare praise for an AI model.

The API Access Problem

Here's the frustration: I can't verify these benchmarks.

OpenAI hasn't given API access to 5.3 outside of CodeX. You can't run it through the API, you can't benchmark it yourself, and you can't integrate it into your own tooling.

They're telling us it's SOTA on X, Y, and Z—but won't let us confirm it.

This matters because OpenAI's own benchmarks have been... generous to OpenAI models in the past. The early access tester sat next to a researcher who had been critical of 5.1 and 5.2, and that researcher was calling 5.3 "the best."

So I have reason to be optimistic, but I also have reason to want to verify it myself.

The Refusal Problem

This is a real trade-off with the better reasoning.

The tester wanted to add a keyboard restoration feature to a site—something competitors like Cursor support. You click a key, you go back to the page you came from. Simple navigation feature.

5.3 CodeX refused: "This could be against terms of service. This could be trademark problems. You can't do that."

5.2 Codex? No issues. Worked fine.

This is OpenAI being overly conservative. The feature isn't illegal—it's standard UX that every other developer tool supports.

I get why safety matters. But this level of conservativeness hurts real-world use cases. You're trying to build a thing, and the model refuses something harmless.

OpenAI needs to find a better balance. Safety is important, but so is not getting in the way of harmless features.

The Missing Reasoning Tokens

This is another frustration.

Anthropic shows you the full reasoning trace. You can see what the model is thinking, step by step.

OpenAI hides it. All that reasoning happens on their servers, and you only get the output.

Why does this matter? Two reasons:

  1. Continuity: If you switch models, you lose all that thinking. If you want to continue a conversation from GPT-5.3 to Claude, all the reasoning that happened is gone. You're starting fresh.

  2. Debugging: When the model does something weird, you can't see why. You just get the result and have to guess what went wrong.

Anthropic giving you the reasoning trace is a big advantage. OpenAI should do the same.

What About Design Work?

5.3 isn't great at design. The early access tester had it generate homepages, and both 5.2 and 5.3 "sucked."

It can handle browser automation, 3D rendering, and web dev decently well. But making things look good? Still not there.

What's interesting is that 5.3 has a "front-end skill" that can call Claude Opus 4.5/4.6 for design work.

This is the future: different models calling each other for what they're best at. 5.3 handles the code, Claude handles the design, and they collaborate.

I'm excited to see where this goes.

The "Plan Mode" Feature

This is hidden under the Shift+Tab menu, and it's powerful.

Instead of just executing, 5.3 plans everything out first. It creates a really thorough plan, then goes and executes.

The tester used this for a migration and it worked beautifully. The model thought through dependencies, potential issues, and the right order to do things, then executed.

This is how a senior engineer approaches work—think, then do.

I've been using "medium" as my daily driver, but I'm finding "high" is often a better sweet spot. Extra high can sometimes overthink and gaslight itself.

What This Means For You

For Solo Developers

5.3 CodeX is a force multiplier. You can take on larger projects, tackle migrations you'd avoid, and move faster.

The autonomy means you're not babysitting the AI. You're steering it, which is how work should feel.

For Teams

The real benefit here is code review and consistency. 5.3 doesn't just write code—it writes code that follows patterns.

It can self-correct, patch strategically, and handle complex migrations without creating tech debt.

For Agencies

This is what Medianeth is about: one engineer who can design, build, and ship.

5.3 CodeX accelerates that. We can take on more clients, deliver faster, and maintain quality.

My Take After Weeks of Use

Is 5.3 CodeX "the best AI coding model ever made"?

For autonomy and communication? Yes.

For raw intelligence? Probably—it's performing as well or better than anything else.

But here's the thing: I still use Claude 4.5/4.6 a lot. Why?

Because it's nicer to work with.

5.3 is incredibly capable, but it's also more serious, more clinical. Claude feels more like a person.

Which one you use depends on the task. For serious, complex work? 5.3. For general coding and brainstorming? Claude.

What I'd Change

If OpenAI is listening:

  1. Give API access. Let us verify benchmarks and integrate this into our own tools.
  2. Show reasoning traces. Don't hide what the model is thinking.
  3. Loosen refusals. Stop blocking harmless features like keyboard restoration.
  4. Release a mini/nano model. GB D5 Mini and Nano are outdated.

Do that, and 5.3 goes from "great" to "unbeatable."

The Bottom Line

GPT-5.3 CodeX is the first AI coding model that actually feels like a colleague.

It communicates what it's doing, makes strategic decisions, and works autonomously for hours at a time.

Is it perfect? No. Design work still needs help, refusals are too conservative, and the API access situation is frustrating.

But for shipping real code on complex projects? It's in a league of its own.

I'll keep using it until something better comes along.


Bottom Line: 5.3 CodeX isn't just smarter—it works with you differently. You're not waiting for output anymore. You're collaborating. And that changes everything.

About Jomar Montuya

Founder & Lead Developer

With 8+ years building software from the Philippines, Jomar has served 50+ US, Australian, and UK clients. He specializes in construction SaaS, enterprise automation, and helping Western companies build high-performing Philippine development teams.

Expertise:

Philippine Software DevelopmentConstruction TechEnterprise AutomationRemote Team BuildingNext.js & ReactFull-Stack Development

Let's Build Something Great Together!

Ready to make your online presence shine? I'd love to chat about your project and how we can bring your ideas to life.

Free Consultation