Software Factory Is the Goal. Here's How to Build One You Can Trust.

You shipped faster for a quarter. Then it flattened.

Here is a pattern I see in almost every engineering org that adopted AI coding tools in the last two years.

The team decides on working fully with Cursor, Codex, or Claude Code. AI analyze the code base and generate some rules, and it seems every PR now is landing faster and high quality, agents do decent work in majority of the tasks, and most of the decisions made by the AI looks right. Pull requests get a little bigger and land quicker, new bugs getting in the code base, there’s a mess, and then u need to hold down the horses, take the time to really analyze every line of code generated, and make sure very well that whats about to be merged - really does what it should, without any nasty side effects and regressions. Six months in, the team is shipping at roughly the same pace it was before. The tool became an expensive habit, not a transformation to your R&D

That plateau is real, and it is not a tooling problem. It is a level problem. The team adopted AI as a faster way to do the same work, in the same workflow, with the same person typing in the same editor. The bottleneck moved, but it did not disappear. It moved from "how fast can I type" to "how fast can I review, decide, and integrate", and nobody redesigned the work around that new bottleneck.

The teams that break past the plateau do something different. They stop treating AI as a talented software engineer that can figure out things on its own, and start treating it as production capacity that needs a system around it. That system is what we call a software factory.

A software factory is a system where AI agents produce software within boundaries the engineering team designs. It is not a building with no people in it. It is not a synonym for full automation. It is an operating model: humans set direction, design the guardrails, and review what matters, while agents do the production work. The important part is the boundaries. A factory without boundaries is just a faster way to generate code you do not trust.

The factory is the destination. I want to be clear about that up front, because the rest of this piece argues for it. The only open question, which I will come back to, is whether you build it with the lights on, with humans in the highest-value seat, or chase the dark factory that runs with nobody on the floor.

The five levels of AI engineering

The ladder is adapted from Dan Shapiro's "The Five Levels: from Spicy Autocomplete to the Dark Factory". Here is what each step looks like when a real team is standing on it.

Level 0 is manual. You write every line yourself. There is nothing wrong with this for the right problem, and most teams still drop back to L0 for the genuinely hard or genuinely novel work. It is a floor, not a failure.

Level 1 is task offloading. You hand the AI discrete, well-shaped jobs. Write this function. Explain this stack trace. Convert this snippet to TypeScript. The AI is a faster reference, and the value is real but bounded. Nothing about how you work has changed; you just got a way better search box.

Level 2 is active pairing. This is where most teams live, and where most teams stall. The model write code , you approve each change, in real time. It feels collaborative and it is genuinely faster than L1. The catch is that you are still in the loop on every micro step. Your reading speed, and your attention are still the constraint. You have made the author faster without changing the fact that there is exactly one author per task. That is the plateau, and the reason it feels like a ceiling is that, inside the L2 workflow, it is one.

Level 3 is human-in-the-loop manager. The shift here is structural, not incremental. The agent does the work end to end and produces a complete change. You stop reviewing keystrokes and start reviewing diffs. Your job moves from author to editor. One person can now have three or four changes in flight, because the unit of attention is the pull request, not the line. This is the first rung that actually breaks the plateau, and it is the one most teams have not climbed, because it requires trusting output you did not watch get produced. The real gap, is not just trusting the output or not, its the systematic approach that enables the agent to produce more quality, strucutred and expected outputs.

Level 4 is autonomous with oversight. You stop writing code and start writing specs. You describe the outcome, the constraints, and the acceptance criteria. The agent plans the work, builds it, runs its own checks, and comes back with something that already passed the gates you defined. You approve outcomes, not implementations. Getting here is mostly about the quality of your specs and the quality of your automated validation.

Level 5 is the dark factory. Agents ship to production with no human in the loop. The software factory is absolutely the destination. The dark factory is not in my opinion (as i write this post - May 2026). For almost every team, and certainly for every team building a product that matters, the goal is a lights-on factory, where humans stay in the loop and review the majority of output, improve the factory context, the best practices and the anti patterns and make sure they ship high quality product.

That last point matters more than it looks. The hype cycle wants the five levels to be a staircase to full automation, with L5 as the prize. In practice, the highest-leverage place to operate is L3 to L4 with the lights firmly on. The teams getting the most out of AI are not the ones who removed the humans. They are the ones who moved the humans to the highest-value seat in the building and built a system good enough to trust the rest.

Which level is your team at?

Before you can climb, you have to be honest about which rung you are standing on. Read these and pick the one that sounds most like a normal Tuesday for your team, or take the two-minute self-assessment and let it place you.

You write nearly everything by hand, and AI suggestions feel more distracting than helpful. That is Level 0 to 1.
You ask the AI for specific functions, explanations, or snippets, then wire them in yourself. That is Level 1.
You code alongside the AI in real time, accepting and editing its suggestions line by line, and it feels faster but you are still typing through every change. That is Level 2, and you are on the plateau.
You routinely hand an agent a whole task, walk away, and review the finished diff like you would review a teammate's pull request. That is Level 3.
You write specifications and acceptance criteria, and agents deliver changes that already passed your automated checks before you look. That is Level 4.
Changes reach production without a human reviewing them. That is Level 5, and if you are here by accident rather than by design, that is worth a conversation.

Most teams reading this will land on number 3. If that is you, the rest of this piece is about the gap between where you are and number 4, and why that gap is made of context and validation, not raw model capability.

What the climb actually takes

Here is the part that gets skipped. Moving from Level 2 to Level 4 is not a matter of buying a better model or writing better prompts. It is a matter of building the machinery around the model. The team at re-cinq frames this well, and it lines up almost exactly with what we have had to build in practice. A working factory needs six capabilities, and a model on its own gives you none of them.

Orchestration Defines who performs each action and in what specific order. This is the overarching system director.
Isolated Environment Uses sandboxes to create distinct, isolated computing environments.
Context & Memory Stores what the agent knows. This is the agent's knowledge base.
Governance Implements guardrails to control agent behavior and actions.
Validation Checks if the agent's actions or outputs are actually correct.
Learning This foundational stage enables the entire system to get better over time through feedback loops and improvement.

Orchestration is deciding who does what and in what order. A real task is not one prompt. It is plan, implement, test, review, revise. Orchestration is the layer that routes a piece of work through those steps, decides when to parallelize, and knows when to stop and ask a human. Without it, you are back to one human babysitting one prompt, which is Level 2 with extra steps.

Isolated environments are the sandboxes where agents run. An agent that can touch your real database, your real deploy pipeline, or your real customer data is a liability, not a capability. The factory gives each agent a clean, disposable place to work, where the worst case is a thrown-away branch and not a production incident. This is the single most common thing missing when a team tells me their agent "did something scary".

Context and memory is what the agent knows about your system. This is the one that separates a demo from a factory. A model with no context will write plausible code that does not fit your architecture, ignores your conventions, and reinvents a utility you already have. Giving the agent the right context, through retrieval over your codebase and docs (RAG), through persistent memory of past decisions, through MCP servers that connect it to your real tools, is most of the work.

Governance is the set of guardrails: what an agent is allowed to do, what requires human sign-off, what is permanently off-limits. Branch protections, required reviews, scoped permissions, audit logs. Governance is what lets you turn the autonomy up without lying awake about it.

Validation is how you know the output is correct without reading every line yourself. Tests, type checks, linting, and, above all, evals. At Level 2 you validate by reading. At Level 4 you cannot read everything, so the system has to validate for you. A factory is only as trustworthy as its validation layer.

Learning is the loop that makes the system better over time. Every failed change, every human correction, every reverted PR is a signal. A factory captures those signals and feeds them back into context, prompts, and guardrails, so the same mistake does not happen twice. - Without this, you have a static tool. - With it, you have a system that compounds.

A few patterns i recognize:

The fintech teams tend to be firmly at Level 2: strong engineers, heavy AI agents usage, and a velocity curve that goes flat after some time. The fix is almost never a better model. It is orchestration plus isolated environments. Once agents can run end to end in sandboxes and open reviewable PRs, the same engineers move to Level 3, and the number of changes one person can shepherd at once climbs sharply.
Teams keep getting "almost right" code that does not match their internal patterns. The entire problem is context. Invest in retrieval over the codebase and a memory layer for architectural decisions.
High-stakes product teams, cannot safely raise autonomy because a bad change has real consequences. The unlock is governance and validation: scoped permissions, mandatory checks, and a holdback eval suite. Only once that machinery exists can they let agents operate at Level 4 on a defined slice of the codebase.

The bottleneck is context and evals, not the model

If you take one thing from this piece, take this: the constraint on climbing the levels is almost never the intelligence of the model. It is the quality of the context you give it and the quality of the validation you check it against.

The context half is intuitive once you have felt it. A frontier model with no knowledge of your system is a brilliant engineer on their first hour at the company. It will write clean, confident code that is wrong for you, because it does not know your conventions, your past decisions, or the three internal libraries it should have used. Closing that gap is what RAG over your codebase, persistent memory, and MCP connections to your tools are for. This infrastructure work is where the actual leverage lives.

The validation half is the one teams underinvest in, and it is the one that gates Level 4 entirely. At Level 4 you are not reading the code. So the question becomes blunt: how do you know it is right?

Part of the answer is the ordinary discipline you already know. The agent builds against a real spec and a real test suite, the same way a human would, and it absolutely should see those. But tests the author can see are tests the author can satisfy narrowly, and an agent is a relentless optimizer for whatever you put in front of it. So you add a second tier: a holdback eval suite, a set of known-good cases the agent never sees while it works, that every change is scored against before it can merge. This is the same logic as a held-out test set in machine learning, where you keep some data back precisely so you can tell whether the system generalized or just fit the examples it was handed. Vercel has a clear introduction to evals, and it makes the right point that evals complement your existing tests rather than replace them. The core idea is simple - You cannot trust output you did not watch unless you have an independent, automated way to catch it when it is wrong.

Here is the shape of it. This is deliberately minimal, just enough to make the idea concrete.

1// A holdback eval: cases the agent never sees during the task,

2// scored automatically so a human does not have to read every diff.

4const holdbackSuite = [

5 {

6 name: 'rejects expired discount codes',

7 input: { code: 'SUMMER2023', cartTotal: 100 },

8 expect: (result) => result.applied === false && result.error === 'EXPIRED',

9 },

10 {

11 name: 'stacks one promo but not two',

12 input: { codes: ['SAVE10', 'SAVE20'], cartTotal: 100 },

13 expect: (result) => result.appliedCodes.length === 1,

14 },

15 // ...dozens more, covering the edge cases that actually break in prod

16];

18async function scoreChange(agentBuild) {

19 const results = await Promise.all(

20 holdbackSuite.map(async (testCase) => {

21 const output = await agentBuild.run(testCase.input);

22 return { name: testCase.name, passed: testCase.expect(output) };

23 })

24 );

25 const failures = results.filter((r) => !r.passed);

26 // Gate: an agent change cannot merge unless the holdback suite is green.

27 return { canMerge: failures.length === 0, failures };

28}

Notice what this is doing. The agent never sees the holdback cases while it works, so it cannot tune its output to pass them the way it can with the tests you handed it to build against. The suite is scored automatically, so a human does not have to read every line to trust the change. And the gate is binary: green merges, red does not. That single mechanism is what lets you raise autonomy without raising risk. It is also, not coincidentally, the thing most teams have not built, which is exactly why they cannot get past Level 3.

Two more honest beats, because they shape how you should build.

Boring technology wins, because agents need legibility. The more conventional and predictable your stack, the better agents perform, because the patterns are well represented and the surprises are few. A clever, bespoke architecture that a senior human can hold in their head is often a liability in a factory, because the agent cannot. There is a second reason that matters more for a lights-on factory: legibility is what keeps human review cheap. If a reviewer can glance at a diff and immediately understand it, the human in the loop stays fast enough to keep up with the line. When you are designing for a software factory, favor the legible choice over the clever one. This is the opposite of the instinct most strong engineers have, and it is correct anyway.

Watch the junior pipeline. This is the other half of keeping humans central, and it is easy to miss. If agents absorb all the entry-level work, you risk hollowing out the path that turns juniors into seniors. The seniors who design and supervise these factories learned their judgment by doing the work the factory now does. That is a real organizational risk, not a hypothetical, and the teams that handle it well are deliberate about keeping humans on enough of the real work to keep growing them. A factory that eats its own talent pipeline is not a win.

Where this leads

The climb from Level 2 to a working lights-on factory is not a model upgrade. It is an operating-model upgrade. It runs on context, validation, and the discipline to keep humans in the highest-value seat instead of removing them from the building.

That is also exactly the work we do at Enpitech, and it maps cleanly onto two of our pillars. Train is how we get your existing team off the Level 2 plateau: hands-on, in your codebase, building the orchestration and review habits that move authors into editors. Adopt AI is how we build the machinery with you: the isolated environments, the context and memory layer, the governance, and the eval suites that make Level 4 safe to operate. We do this inside our own product work and inside several customer applications, which is why this piece is written from the floor rather than the whiteboard.

If you read the five levels and recognized your team on the plateau, the next step is to figure out exactly which rung you are on. Our self-assessment walks you through it in a couple of minutes, and from there our AI Discovery Sprint maps the specific gap to the next rung, with the lights firmly on.

Software Factory Is the Goal. Here's How to Build One You Can Trust.

You shipped faster for a quarter. Then it flattened.

The five levels of AI engineering

Which level is your team at?

What the climb actually takes

The bottleneck is context and evals, not the model

Where this leads

Read next

Design System Is Law. Here's How We Enforce It With design.md

MCP Servers Aren't Just for Developers. They're the New Storefront.

Building Your AI Second Brain: