I've been running Ken AI for two years now. We send over a million cold emails a month. We've built custom AI systems, trained models on client voices, replicated Gmail's spam algorithm.
I'm not new to AI.
But last Sunday humbled me.
Here's what happened.
The Problem That Started It All
Every day I spend about 4 hours on sales and marketing. Creating LinkedIn posts. Updating leads in the CRM. Prepping for sales calls. Writing follow-up emails. Replying to cold email responses. Repurposing content across platforms.
It's the kind of work that's important but repetitive. The kind of work that makes you think "an AI should be doing this."
So I decided to build a system. A full go-to-market engine that handles all of it:
Content creation (LinkedIn, newsletters, X)
CRM updates and pipeline management
Meeting prep with prospect research
Cold email reply handling
Lead follow-ups
Content repurposing across channels
Not a simple chatbot. Not a wrapper around an API. A real system with file structure, integrations, data flows, and business logic.
And then I had the idea that made my Sunday way more interesting.
The Experiment
What if I didn't just build it once?
What if I gave the exact same project to four different AI models, each running through their own CLI, and compared everything? The code quality. The architecture. The speed. Whether it actually runs.
I'd been curious about this for months. Everyone online argues about which model is "the best" but nobody runs real, side-by-side comparisons on complex projects. They test on coding puzzles. On LeetCode problems. On "build me a todo app."
Nobody tests on "build me a production system with 6 integrated modules, CRM connectivity, content pipelines, and sales workflows."
So I did.
The four setups:
GPT Codex 5.3 High running on Codex CLI
Claude Opus 4.6 Medium running on Claude Code
Google Gemini 3 Pro running on Gemini CLI
Kimi K2.5 running on Opencode
The Prep Work (This Matters More Than You Think)
Before I let any model touch my project, I spent 2 hours preparing.
90 minutes on the prompt alone.
I know what you're thinking. "90 minutes on a prompt?" Yes. And it's the reason this experiment actually produced useful results.
Here's why: if you give a vague prompt to four models, you're not testing the models. You're testing how well each model handles ambiguity. That's interesting, but it's not what I wanted to know. I wanted to know which model builds the best system when you give it everything it needs to succeed.
So I wrote a detailed spec. Every module. Every integration point. The file structure I expected. The data flows. The business logic. I wanted zero room for interpretation.
Then I spent 30 minutes configuring each CLI with optimal settings. Context windows, file access, permissions, the works. Every model got the same fair shot.
And then I pressed go.
Round 1: The CLI Experience
Before I even talk about the code, I need to talk about the CLIs. Because this was one of the biggest surprises of the experiment.
The CLI is the interface between you and the model. It's where you read output, navigate files, approve changes, and debug. If the CLI sucks, it doesn't matter how smart the model is. You'll lose 30% of your productivity fighting the interface.
Codex CLI - 4.5/5
This was the best developer experience of the four and it wasn't close. The keyboard shortcuts are intuitive. The UI is clean and minimal. Everything renders fast. Navigation feels native. It's the kind of tool where you forget you're using a CLI because it just gets out of your way.
If you've ever used a really well-designed terminal application - where every shortcut is where you expect it and the output is formatted exactly right - that's Codex CLI.
Opencode (Kimi's CLI) - 4/5
Almost tied with Codex. Seriously, it's that good. The interface is clean, the shortcuts make sense, and the output formatting is readable. I had a few flickering bugs and some minor formatting issues, but nothing that broke my flow. If Opencode polishes those rough edges, it could easily be the best CLI out there.
Claude Code - 3/5
Here's where it gets interesting. Claude Code is strong. The core functionality works well. But the default keyboard shortcuts are different from what most developers expect, and that friction adds up over a long session. I also hit some performance issues and a flickering bug that was distracting.
None of this is a dealbreaker. I use Claude Code daily and it's still my go-to. But in a side-by-side comparison, the UX gap is noticeable.
Gemini CLI - 1.5/5
I want to be constructive here, but Gemini CLI feels like it's still in alpha. Lots of bugs. The color scheme is... a choice. It was genuinely hard to understand what was happening on screen. I found myself squinting at output trying to figure out what was code vs. commentary vs. error messages.
Google clearly invested in the model. The CLI feels like an afterthought.
Round 2: The Models (Where It Gets Really Interesting)
This is the part I was most excited about. Same prompt, same project, four different brains building it.
Claude Opus 4.6 - 5/5
I'll just say it: Opus understood my project better than I expected any AI to.
From the first interaction, it was asking clarifying questions that showed it had actually processed the full spec. Not generic "can you clarify?" questions. Specific ones like "for the CRM integration, do you want bidirectional sync or should it be pull-only with manual push?" Questions I hadn't thought to address in my prompt.
It was the second fastest to finish. The code was clean and well-organized. But what really set it apart was the architecture. The file structure made sense. The separation of concerns was logical. The data flows between modules were clean.
And the kicker - the whole thing ran on the first try. I hit start and the system worked. No debugging. No "oh it forgot to install that dependency." No broken imports. Just... worked.
That almost never happens.
Kimi K2.5 - 4/5
Kimi was the speed demon. It finished before I'd even checked on the other three. And the speed wasn't coming at the expense of quality - the code was solid, the structure was reasonable, and it asked some of the best clarifying questions of any model.
In fact, Kimi's questions were arguably better than Opus's. More targeted, more specific to the business logic rather than the technical implementation. It understood the "why" behind the system, not just the "what."
But here's where it fell short: the project didn't run on first try. Some integrations were wired up wrong. A few modules had assumptions that didn't match the others. I had to go in and fix things manually before it all clicked together.
Still impressive though. The gap between Kimi and Opus is smaller than people think.
GPT Codex 5.3 High - 3/5
I'll be honest - I've never been a big OpenAI fan. I've been using Anthropic's models as my daily drivers since Claude 2. But I wanted this experiment to be fair, so I gave GPT every chance to impress me.
It didn't.
The code itself? Beautiful. Honestly, the best-looking code of the four. Clean formatting, consistent style, well-named variables. The file structure was the second best after Opus. If you were judging purely on aesthetics, GPT wins.
But aesthetics don't matter if the code doesn't work.
GPT didn't understand my requirements well. It built something adjacent to what I asked for, but not quite right. Key business logic was missing or misinterpreted. And it never asked a single clarifying question. Not one. It just started building based on its interpretation.
It was also the slowest of the four. And when it finally finished, the code didn't run. Broken dependencies, mismatched interfaces between modules, incomplete integrations. I spent 20 minutes trying to debug it, realized it would need significant rework, and moved on.
The irony is painful: best-looking code, worst actual results.
Gemini 3 Pro - 2/5
Same fundamental problems as GPT - didn't understand requirements, didn't ask questions, code didn't run. But without the saving grace of good architecture or clean code.
The file structure was messy. The code quality was inconsistent. Some modules looked like they were written by a different model than others. It felt like Gemini was speed-running to produce output without really understanding the overall system design.
I scored it above a 1 because it did get some individual modules working in isolation. The content creation piece was actually decent. But the system as a whole? Not usable.
The Uncomfortable Insights
Here's what keeps bouncing around in my head three days later.
1. The Models That Ask Questions Build Better Systems
This was the clearest pattern. Opus asked good questions. Kimi asked great questions. Both built working (or near-working) systems.
GPT asked nothing. Gemini asked nothing. Both built broken systems.
There's a lesson here that goes beyond AI. When someone starts building without asking questions, they're not confident - they're assuming. And assumptions compound. One wrong assumption in module 1 cascades into broken logic in module 4.
The best engineers I've worked with ask the most questions upfront. Turns out the best AI models do too.
2. "Runs on First Try" Matters More Than You Think (For This Kind of Project)
For a personal system like this - something I built in a Sunday afternoon to automate my own workflows - first-try success is huge. I'm not going to spend days debugging AI-generated code for a tool only I use. If it doesn't work immediately, I'm moving on to the next model.
Now, for production apps? Different story. You want thorough testing, careful architecture, code review. "Runs on first try" is table stakes, not the finish line. But for rapid prototyping and personal tooling, it's the clearest signal of whether the model actually understood what you asked for.
Opus was the only model where I pressed run and the system worked. That told me more about its comprehension than any code review could.
3. The Best CLI + The Best Model Don't Come in the Same Package
This was the most frustrating finding. Codex CLI (best experience) is paired with GPT (disappointing model). Claude Code (mediocre CLI) is paired with Opus (best model). Opencode (great CLI) is paired with Kimi (strong model).
There's no package where both the interface and the intelligence are best-in-class. You have to choose your tradeoff.
Right now, I'll take the smarter model in a worse wrapper over a dumber model in a great wrapper. Because I can deal with weird keyboard shortcuts. I can't deal with code that doesn't run.
4. Speed Is Overrated, Understanding Is Underrated
Kimi was the fastest. GPT was the slowest. Neither of those facts predicted the quality of the output.
What predicted quality was comprehension. Did the model actually understand what I was asking for? Did it grasp the relationships between modules? Did it think about edge cases?
Opus was second-fastest and built the best system. Speed didn't hurt it, but speed wasn't why it won. It won because it understood the assignment.
What This Means If You're Building With AI
I'm not going to pretend this one experiment is a definitive benchmark. It's not. This is one project, one prompt, one developer's experience. Your results might be different.
But here's what I'd recommend based on what I saw:
Test multiple models on your actual work. Not on toy problems. Not on "build me a calculator." On the real, messy, complex thing you need built. The results will surprise you.
Invest time in your prompt. That 90 minutes I spent on the spec was the highest-ROI time of my entire Sunday. A great prompt doesn't just help AI - it forces you to think clearly about what you're building.
Pay attention to clarifying questions. If a model starts building without asking anything, that's a red flag. It means it's guessing. And guessing at scale produces confidently wrong output.
Don't trust vibes. Even though I expected Opus to win (I've been an Anthropic fan since Claude 2), I was still surprised by the size of the gap. And if I'd skipped testing the others, I never would've discovered Kimi. Data beats intuition, even in AI evaluation.
What's Next
I'm now running my entire go-to-market operation through the system that Opus built. Content creation, CRM management, meeting prep, email follow-ups - all of it. I'll share results in a future newsletter once I have a few weeks of data.
And I'm planning to run this same experiment again in 3 months. Models are improving fast. The rankings could look completely different by May.
That's what makes this moment in AI so wild. The best tool today might not be the best tool in 90 days. The only way to stay ahead is to keep testing.
This is based entirely on my personal experience with one specific project. Your mileage may vary. If you've done your own multi-model testing, hit reply - I genuinely want to hear what you found.
