I Posted About Kilo Code. Then I Actually Tested It.

You know the pattern. Someone posts about a new AI coding tool. It looks interesting. You share it, tell yourself you'll try it when you have time. You never try it.

I recently posted about Kilo Code because it looked worth sharing. Then I thought — actually test the thing. Build something real with it. See what it can and cannot do. Write what you find.

So I did.

What this is:

A real test of three AI agents on one project — Claude Code, Kilo Code, and Codex

Built with:

React 18 + TypeScript + Vite — localStorage only, no backend, no login

Agent sequence:

Claude Code (builder) → Kilo Code (UI polish) → Codex (data safety review)

The project:

Recipe Memory Box — save the recipes you don't want to lose

Try it:

Recipe Memory Box →

The project

I didn't want a fake test project. A todo app tells you nothing.

I built something I actually wanted: a personal recipe box that runs in the browser. A place to save the soup from your grandmother's notebook, the ramen you found in a YouTube comment, the chicken you made up one Sunday and nearly forgot.

No login. No backend. No subscription. Just a browser app that saves your recipes locally.

I called it Recipe Memory Box. I used three agents: Claude Code, Kilo Code, and Codex. Each had a different job. That mattered.

Before anyone wrote code

Before the first agent touched a file, I set up a shared instruction kit. One folder, one CLAUDE.md, product specs, a data model, a task list, and a build journal. Every agent got the same map.

This sounds obvious. First attempt proved it wasn't. The first version had separate setup files for each agent. They made slightly different assumptions. The fix was one shared system before any code existed.

The first useful fix was not a code feature. It was giving every agent the same instructions.

What Claude Code built

Claude Code was the builder. I gave it the spec and said: build Phase 1.

It came back with a complete working app. Vite + React + TypeScript. A warm cream-and-terracotta design. Recipe card grid, add/edit/delete, search, filters, favorites, ratings, localStorage, JSON import/export. Five sample recipes loaded on first open.

Zero TypeScript errors. Production build passed first try.

Two decisions worth stealing:

Textareas instead of dynamic lists. The ingredients and steps fields are plain textareas — type one item per line, split on newlines when saving. Not the polished v2 version. The right v1 version. Simpler code, simpler UX, easier to read.

CSS variables before components. The color palette was defined before the first component. When everything uses --color-accent, the whole app feels consistent automatically. That decision cost nothing and paid off everywhere.

What Kilo Code did

After Claude Code finished, I handed the project to Kilo Code for a UI polish pass.

This is where the test got interesting. Kilo Code is designed for focused in-editor work — one file, one problem, small safe change. That's a different job from building an app from scratch. I wanted to see if that distinction held up in practice.

It started with friction.

Kilo Code opened in the project root instead of the recipe-memory/ subfolder. CLAUDE.md came back not found. Then ls -la failed — not a valid PowerShell parameter. Then grep wasn't recognized. Then the ripgrep binary wasn't cached. Four environment problems before a single line of CSS was changed.

None of that affected the output. It adapted: used dir, read the CSS file manually, found what it needed. Then it worked through the task list cleanly.

What it changed — all in globals.css:

The favorite toggle button on recipe cards now scales slightly when activated, so clicking the heart gives a visible response instead of just a color change. One line.

The recipe form's two-column rows (Source/Difficulty, Prep/Cook, Servings/Source Detail) now collapse to single column on mobile. The form was cramped at 375px. Now it isn't. One override in the existing media query.

Every interactive element now has a visible focus ring for keyboard navigation. Filter buttons, tag chips, favorite buttons, modal close, star ratings — none of them had focus styles. All of them do now. The pattern was already in the codebase for inputs; Kilo Code extended it consistently to buttons.

These are the changes that don't get noticed until they're missing. Tabbing through an app and getting no visual feedback is friction. A form that's cramped on your phone is friction. A heart button that doesn't respond when you tap it feels broken even if it isn't.

Kilo Code fixed those things without touching data, without adding dependencies, without changing anything it wasn't supposed to change.

What Codex found

After the polish pass, I ran Codex as the reviewer.

The app looked complete. The happy path worked perfectly.

Codex found the quiet risks.

Partial imports were trusted as full recipes. The validator checked three fields then cast the object as a complete Recipe. Missing tags, timestamps, difficulty — all silently undefined in app state. Old exports, hand-edited storage, imports from other versions: none treated as untrusted.

localStorage was loaded without validation. Everything in the saved array was assumed valid. A corrupted entry had a direct path into the recipe grid.

Importing the same backup twice duplicated everything. The merge strategy avoided ID collisions by assigning new IDs. Technically correct. Practically — import your own backup twice and you now have two of every recipe you own.

Negative numbers saved without clamping. min={0} on the HTML input, nothing enforced on save.

Codex fixed all four. Normalization on import. Normalization on load. Skip duplicate IDs instead of reassigning. Clamp numeric fields before save. Small patches, no new dependencies, build still passed.

The app looked finished. The reviewer found the part users would eventually hit.

What I actually think about the three-agent workflow

Most tool reviews test the tool in ideal conditions. Clean project, obvious task, everything set up perfectly. That's not how real use works.

Real use has environment friction. Real use has a project someone else started. Real use has a specific problem to fix, not a blank canvas.

Claude Code is best when you have a clear spec and need something built end to end. Give it a complete brief. Let it make the initial decisions. It moves fast and it doesn't add things you didn't ask for.

Kilo Code is best when something specific needs fixing. Not the whole app — one file, one problem. The in-editor role is real and different from the builder role. The environment friction it hit (wrong directory, PowerShell gaps) is worth knowing about before you start. Set up the workspace correctly first.

Codex is best as the second set of eyes after the build looks done. Not for finding missing features — for finding the edges where the happy path ends and user reality begins. Import/export, data validation, trust boundaries. The things that don't show up until someone uses the app in a way you didn't plan for.

The workflow that worked: Claude Code built the product, Kilo Code polished the edges, Codex found the hidden risks. Each agent had one job. None of them tried to do all three.

That's the lesson. Not that AI built everything. That the agents were most useful when they weren't all doing the same thing.

Used in this build: Claude Code Kilo Code Codex React + TypeScript + Vite

Three agents. One real project. One lesson worth keeping.

The agents were most useful when they weren't all doing the same thing. Builder, polisher, reviewer. Each one had a job. That's the workflow.

Try Recipe Memory Box →

See all builds →

See another build: Tab Shelf Cleaner →

See another build: Spending Reality Check →