Why Coding Agents Love Layered Baklava Code

Two developers walk into a bar, I mean, in a bi-weekly retrospective:

Vibe Coder: “Dude, the first two weeks were insane. I was shipping features like a 10x machine. Claude just got it. I would describe a feature and boom, working code. I felt unstoppable. Unless for when I ran out of tokens, of course.”

Software Engineer: “Yeah, yeah… why do you have that face then?”

Vibe Coder: “I really don’t know what happened. At some point everything started slowing down. I mean, A LOT. Every PR is a 2,000-line diff that touches half the codebase. I broke the freaking payment flow last Tuesday by adding a notification feature, WTF! Don’t ask me how, I have no idea. The tests pass sometimes. I stopped reading the code honestly, I just follow the vibes and hope for the best. Why is AI acting so stupid lately? Must be that new model!”

Software Engineer: (chuckles softly) “Come’on dude, don’t blame AI, it’s doing wonders in my project. The new model is even better!”

Vibe Coder: “What? How come?”

Software Engineer: “Take a look here. My PRs are, what? 200-300 lines maybe? I shipped that new API integration last week, four files changed. When the bug showed up on Thursday, my AI assistant found it in two minutes because it was an obvious validation issue with one of the models, going from a handler to the service layer.”

Vibe Coder: “Model? Service layer? What are you even talking about?”

Software Engineer: “Sit down and grab some baklava. I’ll explain.”

If you feel like our friend Vibe Coder from the dialogue above, you should take a few minutes and dig into my previous post, where I wrote about the baklava architecture.

A TL;DR version is: how layered application architecture gives you testability, flexibility, and the ability to reason about your code. It doesn’t matter the size of the app or who writes the code.

But here is an anecdotal twist, I did not anticipate. When I first adopted this pattern, it was eons before any AI assistant became available. Now that they are everywhere, one side-effect of clean architecture is crystal clear: layered codebases are dramatically easier for AI agents to work with.

Not by accident, but structural consequence. And once you see why, you will never want to go back to a flat codebase when working with AI.

The wrapping problem

Every AI coding agent I have used, Claude, GPT, ~~Copilot~~, Cursor, all of them, are just way too eager to start coding. They all share the same default instinct: achieve the goal from the prompt as fast as possible. Give it a task and they will create helper functions, classes, utility modules, wrappers, some abstractions. It’s what I call “semi-structured” code. It kinda makes sense in the context of that diff, but does it fit with the rest of the codebase?

This is not a bug in one specific model. It is a pattern that emerges from training on millions of repositories solving the same problems in wildly different ways, with varying degree of (questionable) quality.

The models learned how to code and the teachers were random people from the internet. What could possibly go wrong?

Pet projects, code without tests, not a faint distant shade of good practices. When the code is good, it belongs to mega OSS libraries that are hyper-flexible to cover different architectures and hundreds of use-cases, which are not really suitable for app development. Most important, the code used for training was usually not supporting any real business, under any strict SLAs nor real-life constraints or maintainability concerns.

I’m not saying that all the code publicly available is bad, but maybe 90% is, sorry folks. They are simply really bad!

And while we have many excellent examples of OSS libraries, well structured and covered by tests, the same is not true for complete applications. The design principles of a library and an application are very different.

Without a clear architectural skeleton, the agent has no guidance on where things should go. It invents its own organization. Every time. And every time it invents something slightly different. Monday’s task gets a UserManager. Tuesday’s gets a UserHelper. Wednesday’s introduces a UserFacade. All of them looked fine on the PR diff, but by the end of the week you have five overlapping abstractions that nobody asked for.

Sound familiar? This is what happens in human-maintained codebases too, just slower. AI agents do it at 10x speed or even faster, if you are just “following vibes”.

Structure as a constraint

Here is what changes with a layered architecture: the agent does not need to decide where things go. The structure already tells it.

When you have a handlers/ directory, a services/ directory, a repositories/ directory, and clear examples in each, the agent pattern-matches against the existing code. It sees that handlers are thin, that services contain business logic, that repositories talk to the database. It follows the convention because the convention is right there in the codebase.

app/
├── handlers/
│   └── http/
│       └── user_routes.py      ← thin, delegates to service
├── services/
│   └── user_service.py         ← business logic, DI
├── repositories/
│   └── user_repository.py      ← DB operations only
└── models/
    └── user.py                 ← data structures

Give an AI agent this structure plus a task like “add email verification to user registration” and it will, almost always:

Add a method to UserService
Maybe add a repository method if needed
Update the handler to pass the new parameter
Create or update a model if the schema changes

That is exactly what you or I would do. The structure constrains the agent into the correct behavior. No new abstractions. No invented patterns. Just following what is already there.

AI agents are excellent pattern followers but terrible pattern inventors. Give them a clear pattern and they will replicate it faithfully. Give them a blank canvas and they will paint something… creative.

The guidelines bootstrap

A directory structure alone is not enough. You also need a few lines of explicit guidance. A good README.md or AGENTS.md (or equivalent instructions file for whatever tool you use) that says something like:

## Architecture

- Handlers: receive input, validate, delegate to services, format output
- Services: business logic, orchestration, DI via constructor
- Repositories: data access only, no business rules
- Never import from handlers into services
- Never put DB queries in handlers

That is about 200 tokens of context. Tiny, but you can certainly include more details and refine the rules. Combined with the existing code as reference, it gives the agent enough constraint to produce code that fits. I have been doing this for months now and the results are remarkably consistent.

Even when the agent deviates (and it will, occasionally), the deviation is visible. A service importing from a handler? That stands out like a broken window in your imports. The dependency rule makes violations obvious. You catch them in seconds during review.

The reviewer catches the drift

This is where it gets really fun: specialized agents. I have been running a pattern where a few agents code and others review. The coding agents do the work. The reviewers have the same architectural guidelines and focus on structural violations.

When a coding agent introduces something that breaks the layering, say, a database call inside a handler, or a service that formats HTTP responses, the reviewer catches it immediately. Not because it is smarter, but because the rules are simple and unambiguous. Binary. Either the dependency points inward or it does not.

                                              approve
                                            ↗         ↘
Coding Agents → produces code → Review Agents        human review
      ↑                                     ↘
      │                                      reject
      │                                         ↓
      │                          "Feedback: UserService imports from
      │                           handlers.http. This violates the
      │                           dependency rule. Services must not
      │                           depend on handlers. Changes rejected."
      └──── autoprompt to fix ──────────────────┘

Think about what would happen without clear boundaries. What would the reviewer check against? “Does this code feel well-organized”? That is subjective. “Does this import violate the dependency graph”? That is objective. The reviewer can enforce it mechanically.

Multi-agent coding is something I have been exploring with great success lately (it definitely deserves a separate post, or a whole series). The point here is that layered architecture makes multi-agent workflows tractable. Without clear rules, the reviewer has nothing concrete to enforce and would be just another element to drift.

Smaller files, better attention

There is a deeper, more technical reason why layered code works better with AI. It produces smaller files.

A flat Flask application might have a single routes.py with 2,000 lines. The layered equivalent splits that into maybe fifteen files averaging 80 to 150 lines each. Same total code, but each file is focused on one thing.

Why does this matter? Because of how transformer attention actually works. Research has shown that LLMs perform significantly worse when relevant information is buried in the middle of a long context.

The paper “Lost in the Middle” (Liu et al., 2024) demonstrated that models perform best when relevant content appears at the beginning or end of the input, with accuracy dropping up to 20-30 percentage points for content positioned in the middle of the context window.

Think about that applied to code. When an AI agent reads a 2,000-line file to understand the create_user logic buried at line 847, it is fighting against its own architecture. The attention mechanism gives less weight to tokens in the middle. The model literally notices the beginning and end of the file more than the center.

MindStudio’s analysis of this effect on code specifically found that extraction accuracy was around 89% when relevant code appeared in the first 20% or last 15% of the context, dropping to roughly 61% when positioned in the middle range. That is not a subtle difference. That is the difference between the agent understanding your code and hallucinating something plausible but completely wrong.

With layered architecture, the agent rarely faces this problem. It reads user_service.py (120 lines), finds create at line 15, and has full attention on the relevant logic. The file is the context. No noise, no unrelated functions competing for attention tokens.

Small, focused files are not just “nice to have.” They are structurally better for how these models process information.

The testing advantage

I covered this extensively in the “Baklava Architecture” post, but it is worth repeating in the AI context: layered code is dramatically easier to test.

When you ask an AI agent to write tests for a service with injected dependencies, it produces exactly the kind of focused unit tests you want:

async def test_create_user_sends_verification_email():
    user_repo = FakeUserRepository()
    email_client = FakeEmailClient()
    service = UserService(user_repo, email_client)

    await service.create(CreateUserRequest(
        name="Geordi", email="geordi@enterprise.fed"
    ))

    assert email_client.sent[-1].to == "geordi@enterprise.fed"
    assert "verify" in email_client.sent[-1].subject.lower()

No HTTP server. No database. No Docker containers. The test is fast, deterministic, and focused on one behavior. AI agents are excellent at generating these because the pattern is dead simple: inject fakes, call method, assert result. Even a junior model can get this right.

Now compare that to testing a fat route handler that does everything. The agent has to spin up a test client, somehow mock the database, deal with authentication middleware, handle response parsing. The test becomes complex, brittle, and often just wrong. Who is going to debug it? You? How will you even notice it is wrong? We are talking about test cases sometimes longer and more complicated than the implementation itself!

The layered version gives you better coverage with simpler tests. And since the tests are simpler, the AI agent writes them correctly more often. It is a positive feedback loop: good structure produces good tests, good tests catch bad code, bad code gets rejected, structure stays clean.

The size paradox

Here is something counterintuitive. A layered codebase has more files and more directory structure than a flat one. At first glance, it looks bigger. More boilerplate. More ceremony. More scrolling in the file tree.

But the total lines of code? Actually fewer.

Because layering forces you to think about what each piece does. No duplication across handlers because the logic lives in the service. No copy-pasted queries because the repository encapsulates them. No utility functions scattered in twelve different files because they have one canonical home.

Structure makes redundancy visible. When you see two services doing the same thing, you notice. In a flat codebase, the same duplication hides in 2,000-line files where nobody scrolls past the first hundred lines. It is invisible until it causes a bug. And then it causes two, because you fixed one copy and forgot the other.

More directories, fewer lines. More files, less code per file. More structure, less total complexity. It only looks like a paradox if you measure complexity by counting files in the tree. Something that is in the same category of measuring productivity by LOC.

”But that’s too much boilerplate!”

No it is not. A monolithic file is not simpler, it is just shorter and denser pile of mess in a single place. The complexity is there, compressed into fewer files where it is harder to find and harder to reason about. A 120-line service with one responsibility is genuinely simpler than a 2,000-line router where that same logic hides at line 847 between unrelated functions. “Too many indirections”, “too many files”, same spirit. They trade consistency for the immediate term dopamine reward of not having to think. You are not being fast, you are not being productive. You are being lazy and hoping for the best.

Nobody reads a 2,000-line file top to bottom. You search, you scroll, you lose your place. With layers, you open user_service.py, find what you need in seconds. Done.

The one objection that deserves a real answer is “slows down prototyping.” And it is almost true. Your first day is a bit slower. You create more files. You think about where things go before you write them. That takes ten extra minutes.

But then week two happens. And week three. And you never spend thirty minutes debugging a test failure caused by some unrelated function sharing the same module. You never spend an hour tracing a bug through a 500-line handler that validates, queries, decides, formats, and logs all in one breath. Your initial velocity costs you maybe 5%. Your sustained velocity is 10x faster, because you never hit the wall that our friend Vibe Coder crashed into at week three.

The real overhead is in not layering: the debugging sessions, the duplicated logic, the tests that need full infrastructure, the fear of touching a file because everything depends on everything else.

Your codebase is your best prompt

People obsess over prompt engineering. They craft these elaborate system prompts, tweak the temperature, try different models. And sure, that stuff matters a little. But the single most influential thing your AI agent reads is not your prompt. It is your code.

That is the real takeaway. Your AGENTS.md is a pamphlet. Your codebase is the textbook. The agent will mimic what it sees, not what you told it to do in 200 tokens of instructions. If what it sees is layered, focused, and consistent, that is what it produces. If what it sees is a swamp, well, you get more swamp.

Will it be perfect? No. Nothing is perfect with probabilistic systems. But it will be consistent enough that you spend your time directing instead of correcting.

Vibe Coder: “So you’re saying the reason my AI acting dumber is not the model, it’s my codebase?”

Software Engineer: “Yes, 100%! What’s the surprise? You are basically swimming in a spaghetti pool, man. It was bad in codebases maintained by humans, it will be the same, actually worse, with AI. The model you use is the same one I’m using. Same capabilities, same intelligence. But mine has guardrails. Yours has a blank canvas and a loaded paintball gun.”

Vibe Coder: “And the guardrails are just… directories? And a few rules in a file?”

Software Engineer: “Directories, a few rules, yes, but also a ruthless pair of engineer eyes that don’t accept slop. And, most important, I provide examples. The agent reads your existing code to figure out how to write new code. If your existing code is a mess, the new code will be a mess too. Garbage in, garbage out. It has always been like that.”

Vibe Coder: “So what do I do now? My app is already…”

Software Engineer: “A disaster? A complete mess? Yeah. But look, maybe it is not too late. Refactoring messy applications is something I did a few times in the past and can also be nicely done with AI assistance. You just need a clear target, semi-mechanical processes and a lot of patience. But you definitely need to stop “following vibes” and get your sh*t together. Give a proper blueprint to your agent and…”

Vibe Coder: (stares at baklava) “Wait, are these pistachios?!”

Software Engineer: (facepalm)

Read other posts

< [Baklava Architecture: Your Python App Needs Layers] :: [Hooking the agent that built the hook] >