Write for the AI that doesn't exist yet
How I built a real-time trading game with AI using specs, an AI manager, and almost no human coding.
Last Thursday, I built a real-time trading game. A proper one: Go back-end, Next.js front-end, Canvas-based grid rendering, WebSocket real-time price feeds, PostgreSQL, Redis, Docker Compose, nginx reverse proxy. About 3,000 lines of code across 49 files.
I didn’t write a single line of code. Neither did my AI assistant.
I gave roughly 10 messages of guidance across a 173-message development thread. My AI assistant read the specifications, drove a coding AI through 13 phases of development, verified every build visually, filed bugs in plain English, and iterated until a live game was running on a public URL. The coding AI read the specs and wrote everything.
From vibe coding to spec-driven AI development
I’ve been writing about AI-driven development for almost a year now.
It started with vibe coding, prompt-to-PR, fast and loose. We cranked out a trading platform POC in three weeks. Then our CEO asked: “What exactly did the LLM develop?” Nobody could answer. That was the wake-up call.
So I built a spec-to-PR process. Hundreds of lines of specifications per module. A process to consume specs and produce code. Verification loops where the AI reviews its own output three times. Separate LLMs for verification. A modification router that classifies changes and cascades them through the right phases. The process grew to 13,000 lines of prompts; it became a codebase itself. I ended up building tools to manage the tools.
That process works. We validated it at L1 (single service) and L2 (multi-service with auth). But it’s heavy. The process itself is a dependency. Every project needs the full orchestration machinery.
What happened last week was different.
How I built a real-time trading game without writing the code
We’re building a trading platform. The entire architecture is defined in 10s of markdown files, specification documents that describe every component, every API contract, every data model, every UI behaviour.
The specifications form a dependency graph:
platform_standard.md: the contract between the platform and all gamesshell_spec.md: the host app (auth, routing, shared state, Module Federation)arcade_spec.md,wallet_spec.md,profile_spec.md: micro-frontendsactivity_spec.md: stats aggregation back-endtrading.game_design.md: the design systemOne
product_spec.mdper product, fully self-contained
And there’s a start.md file for each component. Not a spec but an onboarding document for an AI developer. It says: here are the files you need to read, here is where to write code, here is what “done” looks like, here is the build order, here is the acceptance test.
I told my AI assistant, Calum, running on OpenClaw, to build the Grid game. Here’s approximately what I said, across the entire multi-hour session:
1. "I want you to help me develop Grid game. Do you know how? First talk to me."
2. "Tell me your first 10 steps."
3. "OK, let's go ahead."
4. "What happened?" (when it went quiet)
5. "Don't fix it yourself, just report to Claude Code." (when it started touching code)
6. "I see several major issues in the game!" (after looking at a screenshot)
7. "Do you see any market chart? Did we check if we can place a contract?"
8. "Check with Claude Code and drive the development without me."
9. "Can you share a screenshot?"
10. "Good. That is it for now."That’s it. Ten messages. The rest was Calum reading specs, spawning Claude Code, verifying Docker builds, catching WebSocket bugs, tracing coordinate mismatches, filing plain-language bug reports, and iterating until the game worked.
From developer to AI manager: the new software org chart
In my last post, I described building an LLM developer: 13,000 lines of prompts that could take specifications and produce code. That developer worked. But I was still the one managing it: reading specs, verifying output, filing bugs, deciding what to fix next, improving the process. I was the product manager, the QA lead, and the engineering manager, all while the LLM handled the coding.
This time, I built the manager.
The three-layer chain that emerged isn’t just “human → assistant → coder.” It’s a real org chart:
Me (CPO): Set direction, validated the plan, course-corrected twice (“don’t touch code” and “drive it without me”), verified output by looking at screenshots, provided infrastructure (upgraded server memory, set up ngrok). Ten messages across a multi-hour session. Pure strategic oversight.
Calum (Technical Product Manager / Engineering Manager): This is the role that used to be mine. Read all specs and understood the product deeply enough to know what “done” looks like. Spawned the coding AI with the right context. Verified every Docker build. Used Chrome’s debug port to take screenshots and test the product. Traced bugs through WebSocket logs. Translated symptoms into actionable bug reports. Managed round strategy: which bugs to batch, when to push back, when to move on. And after the build: ran retrospectives, updated the process, wrote scripts, improved the methodology. Never wrote a single line of code.
Claude Code (Developer): Read the specifications, planned the architecture, wrote all 49 files, fixed bugs when they were reported. Pure execution from specs.
That middle layer is not an “assistant.” In a traditional team, that role is closer to a Technical Product Manager or Engineering Manager: someone who owns the product requirements, manages the developer, owns quality, and improves the process. The difference is that I used to do all of that myself. Now the AI does it, and I moved up to giving strategic direction.
There’s one more thing the middle layer does that the word “assistant” completely obscures: it helps design and modify the specifications themselves. The root start.md is a spec editor: it reads all specs, understands their dependency graph, and proposes coherent changes across multiple files when requirements change. The manager doesn’t just execute against specs; it participates in shaping them. When I want to add a feature to the platform, I can discuss it with the manager, it proposes changes across the affected spec files, I approve, and then the coding AI rebuilds from the updated specs.
Over 13 phases, Claude Code dealt with:
WebSocket message format mismatches (backend wrapping payloads, frontend expecting flat messages)
Coordinate system conflicts (backend using absolute grid indices, frontend using relative-to-NOW)
Go HTTP server timeouts killing WebSocket connections (a well-documented footgun)
WritePump batching multiple JSON messages into individual WebSocket frames
Synchronous database calls blocking WebSocket responses
Multiplier calculation edge cases near the current price
In-memory balance state lost on container restarts
And QA!
Each bug was found by the manager through visual verification and log analysis, reported in plain English, and fixed by the coding AI. I found a couple of issues myself by looking at screenshots (“I see several major issues” and “Do you see any market chart?”), but Calum was already catching most of them.
Why this isn’t vibe coding
Vibe coding is a human sitting with an AI, steering it line by line, prompt by prompt. It’s interactive. It’s hands-on. It works for prototypes and small projects.
What happened here is fundamentally different in three ways.
First, the specifications are the product. The Grid game’s product_spec.md defines every mechanic: the grid overlay, multiplier calculations, settlement logic, WebSocket protocol, database schemas, Docker setup. Combined with platform_standard.md, it’s everything an AI developer needs. If I threw away all the code tomorrow and pointed a different AI at the same specs, I’d get the same game back.
Second, no human was in the coding loop. I didn’t tell anyone what file to create, what function to write, or what architecture to use. I said “build Grid game” and verified the output. The manager didn’t write code either; it verified and reported. The coding AI read the specs and made all implementation decisions. The human role was strategic oversight, not development management.
Third, the specs survive everything. Every bug, every rebuild, every model change, the specs stay. The code is disposable. In our session, Docker containers were rebuilt multiple times, the server was restarted, the entire backend was restructured to fix WebSocket issues. The specs didn’t change once. They didn’t need to.
This is the key difference. Vibe coding produces code that you need to understand and maintain. This approach produces specs that you understand and maintain, and code that’s generated, tested, thrown away, and regenerated as needed.
Why AI models are becoming a commodity
A few days before the Grid game session, I ran an experiment. Same specs, four different models, same task: build a trading product.
MiniMax M2.7 (~230B params) → $3.06, homepage rendered, game completely broken
Xiaomi MiMo-V2-Pro (~1T params) → $12.24, solid backend, frontend non-functional
GPT-5.4 (~2-5T params) → $7.62, fully working, clean design
Claude Opus 4.6 (~4-6T params) → $36.75, best overall quality, 5× the cost of GPT-5.4
The pattern: models below the complexity threshold produce broken output you’d spend more fixing. Models above it deliver diminishing returns at a premium cost. For this level of problem, GPT-5.4 was the sweet spot.
I could run this experiment at all because the specs are model-agnostic. The same markdown files went to four different models. I didn’t rewrite prompts, I didn’t adjust for model quirks, I didn’t change my approach. The specs define intent. The model is the execution engine. Swap it out, compare results, pick the best value. Model selection becomes a procurement decision, not a technical one. Which factory builds the best product from these engineering drawings, at what cost?
And the specs aren’t optimised for today’s models. They’re written for whatever comes next. The same markdown files that produce a working-but-rough product today will produce a polished, robust product with a better model tomorrow, without changing a word. I’m not optimising for Opus 4.6. I’m writing for whatever ships in 2027.
Natural language as executable specification
In my earlier post, I described what I called Cognitive Programming Language: natural language as executable specification. What I’ve built with the trading platform is CPL in practice. The product_spec.md files aren’t documentation; they’re programs that execute on an AI runtime. And the root start.md is itself a CPL program: it reads all specs, understands their relationships, and modifies them coherently when requirements change. When I want to add a feature to the platform, I don’t edit code. I talk to the spec editor, it proposes changes across the affected files, I approve, and then a coding AI rebuilds from the updated specs. The engineering drawing is the product. The factory is replaceable.
How the system improves the process after every build
There’s one more piece I added to the skill after the Grid game session: a self-improvement loop.
During development, the AI manager logs every friction point, every moment the skill’s instructions fell short, every missing verification step, every pattern that worked better than what was prescribed. These go into a self_improve.md file. Write-only during the build. No distractions.
After the build is complete and the pressure is off, the manager reviews the log with me. Observations become concrete changes: updated rules in the skill, new helper scripts, better verification steps. The processed entries get archived.
This means the process that builds products from specs is itself improving with each product it builds. The first Grid game session exposed WebSocket verification gaps and screenshot timing issues. Those are now baked into the skill for the next build. The next product starts from a better process than the last one.
It’s the same principle as the specs themselves: durable, accumulating assets. The specs define what to build and get better as models improve. The skill defines how to build and gets better as the manager learns. Both compound. Both survive model changes, code rewrites, and context resets.
The manufacturing analogy extends naturally: you don’t just improve the engineering drawings, you improve the factory floor process too. Each production run teaches you something about quality control, tooling, and workflow. You write it down, refine the process, and the next run is smoother.
What the second product taught me about AI-led development
Everything I described above sounds clean. Elegant, even. One product built from specs, a self-improvement loop, the manufacturing analogy. It reads like it all worked on the first try.
It didn’t. The real test came when I built the second game.
Same specs architecture. Same three-layer chain. Same AI manager, same coding AI. Different product: GridRush, a real-time price-prediction game with a Canvas-based grid, WebSocket streaming, and a Markov chain pricing engine. Twelve hours. Thirty bugs. And the most valuable lessons came from the failures, not the successes.
Why screenshot-based QA misses critical product bugs
The first build gave me a working process: write specs, point AI at specs, verify output, iterate. The self-improvement loop was already in place. The AI manager already knew not to write code, not to debug code, not to plan architecture.
And yet, within the first few hours of the second build, the manager was doing exactly what the skill told it not to do: spending tokens tracing coordinate math through renderer.ts, manually working out Math.floor vs Math.ceil rounding behaviour, reading 900 lines of Canvas rendering code to diagnose an off-by-one hover alignment bug.
I had to correct it: “You actually depend on the Claude Code session to find and fix the bugs. Claude Code is attached to the best LLM model available and equal to you in smartness. As long as it has the context, always starting with start.md, then reading the specs and knowing the project, it should be able to find and fix the bugs for us.”
The rule was already written in the skill: “Never debug code. Describe symptoms, let Claude Code fix.” But knowing the rule and following it under pressure are different things. When the manager saw a bug that it could almost diagnose, the temptation to investigate was overwhelming. It spent more tokens investigating three bugs than Claude Code spent fixing them.
This is the manager / developer boundary in practice. In theory, the roles are clean. In reality, the manager has to actively resist doing the developer’s job, especially when they can see the code, understand the architecture, and have a theory about what’s wrong. The discipline isn’t “can’t investigate”; it’s “shouldn’t investigate, because the developer will do it better with the full codebase in context.”
The same tension exists in human teams. A technical PM who used to be a senior engineer will constantly fight the urge to debug the issue themselves instead of writing a clear bug report and trusting the developer. The skill now says it differently. Not just “don’t debug code” but “describe symptoms, not diagnoses. Your job is to find bugs through testing and describe them accurately. The developer’s job is to read code, trace root causes, and fix them. Trust the separation.”
The targeted delivery bug and what it revealed
Here’s what three bugs had in common:
Bug 1: Clicking the Buy Row said: “could not place any plays.” The row purchase button looked correct: arrows, multiplier text, layout all fine. But nobody had actually clicked it. Screenshots showed it rendered. Nobody tested if it worked.
Bug 2: Hovering over a cell highlighted the cell one row below the mouse. The manager took a screenshot of the hover state, and the vision model said: “appears to be at the correct position.” A one-pixel-row offset is invisible in a static image.
Bug 3: When placed chips scrolled into the no-buy zone, the entire grid jumped one column to the right. This happened every five seconds when the server’s time epoch advanced. But every screenshot was a single frozen moment; you’d never see a transition artefact in a still image.
Three different bugs, one pattern: screenshot-based QA cannot catch interactive, positional, or temporal bugs. And the first build’s QA was almost entirely screenshot-based.
The manager caught layout issues, colour problems, missing text, visual overlaps, everything a vision model is good at. But it completely missed:
Interactive bugs: Does clicking the button actually do the thing? Not “does it look like a button,” does the action succeed?
Positional bugs: Is the highlighted cell actually the cell under the cursor? Not approximately, but exactly?
Temporal bugs: Does the animation stutter at periodic boundaries? Not at this instant, but over the next 30 seconds?
The skill now has four explicit QA layers: Infrastructure (does it run?), Interaction (does it respond to every input?), Outcomes (does every end state actually work?), and Time (does it hold up over 30+ seconds?). Each layer catches a different class of bug. Skipping any one of them means entire categories go undetected.
The bug that exposed a deeper flaw
The most instructive bug came near the end: winning bets never settled.
Losing bets worked perfectly. When a bet was lost, the settlement message was broadcast to all connected clients, the UI showed “Lost $1.00,” and the balance stayed correct. Everything looked healthy from a happy-path perspective.
But winning bets, which trigger early settlement and send a targeted message to the specific client who won, never reached the front-end. The balance was never credited. The UI never updated. The bet just stayed “Active” forever.
The root cause was an account ID format mismatch. When a play was placed, the account ID got hashed into a UUID. The WebSocket hub stored clients by their original string ID. When a win was settled, the callback tried to send the settlement message using the UUID, which didn’t match any registered client. Message silently dropped. Balance credited to a ghost account.
This is a class of bug, not a one-off. Any system where:
One component stores identities in Format A
Another component looks them up in Format B
Broadcast paths work (they fan out to everyone, no lookup needed)
Targeted delivery paths fail (they need the exact key, and it doesn’t match)
…will exhibit the same pattern: “it works for losses but not wins,” or “notifications work for everyone but not for this specific user,” or “the message was sent but nobody received it.”
I now call this the “targeted delivery test.” After any system test: verify that per-user messages actually reach the user, not just that broadcasts work. The skill includes it as a named concept. It’ll apply to every future product that has per-user messaging.
How the process improved after the second build
After the second build, I sat down with the manager and rewrote the skill. Not “appended some notes” but restructured the core methodology based on what failed.
The QA layers are the obvious improvement. But the subtler changes matter more:
The cleanup script. During the second build, stale Claude Code processes accumulated to 39 instances using ~6GB of RAM. They OOM-killed Chrome, which killed the remote desktop, which killed the debug session. The skill now includes a cleanup script that runs before and after every coding session, with a hard rule: never more than one Claude Code instance running at a time.
The smoke test. Several QA rounds were wasted finding bugs in a broken build. The Docker image hadn’t rebuilt cleanly, and we were testing stale code. The skill now gates all QA behind a smoke test: health endpoints, Docker status, WebSocket connectivity. If the build is broken, don’t test features.
The bug report format. The first build’s bug reports mixed symptoms with diagnoses. “The WebSocket message handler on line 85 uses so.AccountID.String() which returns a UUID, but the hub is keyed by string ID.” That’s a diagnosis, not a symptom. The second build taught us that the symptom (“winning bets never settle, losing bets work fine”) is more useful to the coding AI because it doesn’t constrain the search to one hypothesis.
The journal as a context bridge. On surfaces where persistent sessions aren’t available (Slack, in our case), each coding round starts fresh. The journal file (journal.md) becomes the context bridge: the coding AI writes what it did, what it changed, what’s left, and the next round’s task points it at the journal first. This is imperfect (information loss at each boundary) but workable.
Why specs and skills are the only assets that compound
Here’s the punchline: none of these improvements is specific to GridRush. The four QA layers apply to any interactive product. The targeted delivery test applies to any system with per-user messaging. The cleanup script applies to any process that spawns subprocesses. The smoke test applies to any Docker-based build.
The next product I build starts with a better process than GridRush had. And the product after that starts better still. Each build deposits its lessons into the skill, and the skill applies to all future builds regardless of what they are.
And here’s what’s interesting: the skill improvements are themselves driven by the three-layer chain. I notice a process gap (“Why didn’t you catch that bug?”), the manager diagnoses the gap and proposes a fix, and the fix gets written into the skill. The system that builds products is improving itself using the same delegation pattern that builds the products.
Why code depreciates, but specs and skills compound
This system produces two durable assets that gain value over time, and everything else in the stack is disposable.
Asset 1: The specifications. They define what to build. They appreciate as models improve: the same markdown files that produce a rough product today will produce a polished one with a better model tomorrow. They’re model-agnostic, framework-agnostic, language-agnostic. Written once, executed by whatever AI runtime exists next year.
Asset 2: The skill. It defines how to build. It’s the operating manual for the AI manager: how to read specs, how to drive a coding AI, how to do QA, how to report bugs, when to escalate. And it improves with every product. The Grid game taught it to verify WebSocket connections. GridRush taught it four QA layers, the targeted delivery test, symptom-first reporting, and process cleanup discipline. The next product will deposit its own lessons. Each build makes the skill better, and the better skill applies to every future build.
Both compound. Both survive model changes, code rewrites, framework migrations, and context resets.
Now look at what depreciates:
The code rots. It accumulates technical debt from day one. It’s coupled to specific frameworks, specific language versions, specific deployment targets. Every dependency update is a potential breaking change. We rebuilt the Grid game’s entire backend during the session. The specs didn’t change. The code was disposable, and we disposed of it.
The agent framework depreciates too. OpenClaw, the platform that runs my AI manager, is infrastructure. It will be replaced, upgraded, or obsoleted by whatever comes next. The same is true for Claude Code, for Docker, for Go, for Next.js. These are the factory equipment, not the product and not the process knowledge.
The models are commodities. I showed this with the four-model experiment: same specs, four different models, different price points. The model is a runtime. You pick the best value today and switch when something better ships tomorrow.
The traditional software stack inverts this completely. Teams invest in code (depreciates), frameworks (depreciates), and model-specific prompt engineering (depreciates). They treat specs as documentation (an afterthought) and process knowledge as tribal (lives in people’s heads, lost when they leave).
This approach flips the investment. The specs and the skill are the primary assets. Everything else is replaceable infrastructure. And because both assets improve over time, one from better models and one from accumulated process learning, the system gets better even when you’re not actively building.
What this means for the future of software engineering
Three shifts are happening at once, and most teams are only seeing the first one.
Shift 1: AI writes code. Everyone sees this. Copilot, Cursor, Claude Code, Codex. This is table stakes now.
Shift 2: AI manages the AI that writes code. This is where I am. In my last post, I described building an LLM developer. This time, I built the manager. The 13,000-line process I described before, the orchestrator, the modification router, the verification loops, collapsed into two things: well-written specifications and a skill file that defines how the manager operates. The human’s role changed from engineering manager to CPO. Most teams haven’t made this jump yet because it requires the right specification infrastructure and, frankly, the willingness to let go of the engineering work.
Shift 3: Specs and skills become the primary investment. This is the one that changes everything. When code is disposable, and models are interchangeable, there are exactly two durable assets: the specifications (what to build) and the skill (how to build). Everything else, the code, the frameworks, the models, the agent platforms, depreciates. The competitive advantage shifts from implementation speed to specification quality and process maturity.
The roles don’t need 13,000 lines of process orchestration. They need well-written specs, a self-improving skill, and a clear separation: one agent that owns the product, one agent that builds it. Write the specs. Point an AI at them. Let the skill guide the process. Verify the output. Iterate. Each cycle makes the specs more precise and the skill more capable. The code is a side effect. The specs and the skill are the product.
Kaveh Mousavi Zamani is a Vice President of Engineering at Deriv.
Follow our official LinkedIn page for company updates and upcoming events.
Join our team to work on projects like this.






