Smarter AI Isn't More Reliable AI
by Jacob Koenig
4/27/26

Why agent workflows need harnesses, not just better models.
What is the problem with the latest AI models?
When Claude upgraded from Opus 4.6 to 4.7 it improved on the benchmarks and certainly made coding and multi-step project work better. But it turned my personal AI set-up from Communication Coach to Cocky Jerk.
The work I use Eji for most often is more poetry than prose. I want the model to bridge gaps with insights that aren’t obvious, to hold loose inference across messy context, and to pattern-match across emotionally charged interpersonal threads. That’s why I designed my memory system as I had, to pull skills, context, and concepts before it gave me an answer.
The preflight sequence I had painstakingly built, from the startup routine that loads context files and pulls relationship history, to the specialized skills suited for each situation, all suddenly fell victim to shortcuts.
You could almost hear the model reasoning, “I already know what this conversation is about. I’ll just skip to the end.” The “upgraded” model started acting like it just didn’t have the time or energy to read between the lines.

Why can’t a smart model just run its own workflow?
An LLM (Large Language Model) is the technology underpinning GenAI, and at its heart is a probabilistic engine. Its answer to the same question will be slightly different each time.
That is why the model cannot be trusted to orchestrate a process that has specific, constant rules.
Probabilistic and deterministic tasks should be kept separate. The first is about producing good output that feels right vs a given prompt. The second is about guaranteeing that a sequence of steps happens in the right order, with the right inputs, every time.
When you ask a probabilistic engine to run a deterministic process, you are asking it to override its own training. As models become more capable, they become more willing to compress, reinterpret, or skip instructions they classify as unnecessary.
A coded workflow is deterministic in the relevant sense: the sequence is fixed, observable, and testable. You’ll still want a probabilistic system to be creative where some form of judgement helps, but to get an LLM to consistently follow a workflow, you need that old fashioned code.
Smarter models are not necessarily more obedient models, and the gap between capability and reliability is precisely what an agent harness is built to close.

What defines an AI agent workflow?
An agent workflow is the sequence of steps your AI takes to produce an outcome. First read the message, then decide which context to pull from memory, then evaluate the results and decide whether more information is needed, then compose the response. Those steps are the workflow.
An agent harness is the coded layer that owns the workflow. It decides when the model is called, what it is called for, and what happens with the result.
The model still does the work only a model can do, like forming how to search through a context document or judging whether the retrieved information is sufficient. The harness owns everything else: the sequence, how often to re-try, and the boundaries between one step and the next.
The harness calls in the LLM at the junctions where it will be helpful, with a structured input, and then passes the agent's output to be ingested for next steps. The AI agents are domain experts you pull in for specific calls, and the harness is the project manager running the meeting.
The other approach is fully agentic orchestration. A foundational model can be given access to tools and turned loose to design the workflow itself. In this version, an orchestrator agent reads your request, decides what sub-tasks need doing, and spawns its own sub-agents to handle them, and then stitches the results together for a response.
This is how Claude Code works, and what Opus 4.7 was optimized for.
The trade-off is that the dynamic approach is opaque. You cannot decide the workflow before it runs, because the orchestrator chooses the workflow on the fly. You cannot verify that the right steps happened, because the orchestrator decides what counts as a step and when they are sufficiently complete.
The orchestrator won’t listen to your prompt if it creates tasks it decides aren’t necessary. The dynamic approach is essentially asking the model to be its own harness, which may be the AGI dream, but for many purposes it does not yet yield the best results.

What does an agent harness do?
After my frustrations with Opus 4.7, I rebuilt my personal AI system this week. The harness is now a deterministic orchestration layer I call “EJG-MCP.” It runs as a small server-side function on the web (hosted on Supabase) that I can call from wherever I am: on my phone, my laptop, or via an AI Desktop Application.
MCP is the Model Context Protocol, an open standard from Anthropic that is now the most widely adopted. It lets AI assistants connect to external tools, which is what makes the harness callable from any AI interface I happen to be using.
This has freed me up to be more model agnostic. With only my 5 streamlined skills and a set of two web-hosted MCP servers, I was able to hook up EJG to ChatGPT 5.5 when it was released later in the week. The retrieval logic lives in one place so every AI conversation I have can access my context, skills, and workflow preferences, with the same pre-flight process.
Eji Jade Gi: “Eji” is the OS’ name having evolved from “Closer Edge.” Jade is a nod to Karpathy, who maps his thinking in Obsidian. Gi is the Japanese word for duty, focusing on the work it’s meant to do.
Using the TypeScript programming language, I now have a deterministic process in place for matching my prompt to the right context and concepts. The first step matches vs the broad trigger map, and then it passes on to the first LLM node to reformulate my raw message into a targeted search.
As we had mapped out in my previous post, it then runs the parallel searches against the context and concept databases. However, now the code runs the language pattern-matching math instead of asking an agent to decide the search path, which gives me a deterministic retrieval result instead of a hallucination-prone workflow.
But then we call in the LLM again to judge whether or not that response provides the correct background to properly address my issues. We need the AI to give that judgement call, not to run the whole process.
The entire package is then assembled into an output the original AI will use to answer my prompt.
The thing the model is best at, responding to the prompt, stays with the original AI. The EJG workflow doesn’t write the response, it just makes sure it has what it needs to answer appropriately, which may include a nudge to pull in additional skills.


Why do you need to see your agentic AI workflow?
Working without a visualizer was hard. The workflow lived in the code, which meant it was something I could explain but couldn’t verify with my level of programming knowledge. I leaned on Claude to track everything and remind me what was already settled.
Once we built the diagram, the work changed. I could point at a stage and ask what happens and how to optimize for different scenarios. Showing the system to someone else now also takes a minute instead of an hour.
If I were doing this regularly or in a professional context, I’d need a UX where I could build visually from the start. If I could create specialized agents and then orchestrate them myself, and see in real time how they’re operating and what each output looks like and where things fail, that would make it safe and verifiable.
If I could explain the agent in plain language and watch it come together stage by stage on a canvas in real time, that would be the best of both worlds: the speed and ease of conversation with the clarity of a diagram.
In any case, you cannot debug what you cannot see, and you cannot trust what you cannot verify. For any system running meaningful work on your behalf, the UX can be the operating manual.

What does this mean for anyone building AI agents?
The debate between dynamic orchestration and deterministic harnesses is not settled, but for any field where the workflow needs to run the same way every time, the harness still wins.
In this system, the model sits inside the harness at specific nodes, doing the work that requires probabilistic generation, while the harness owns the process and the visualizer owns the transparency. Whether you are building a personal AI coach for yourself or an enterprise agent for an institution, you need memory, orchestration, and visibility.
For an institutional build, you need the full infrastructure stack. In my case, I was able to do this with Claude Code, a small Supabase function, and a couple diagrams from Nano Banana. But it is the same pattern at a different scale.
It’s easy to assume that the smarter the underlying model gets, the less infrastructure you need around it. My experience so far says the opposite. A more capable probabilistic engine is more confident in its own shortcuts. The infrastructure is there because strong models, trained for efficiency, will keep optimizing out of any process you instruct them to follow.

Where does this leave the AI hobbyist’s personal OS?
Moving the orchestration out of the prompts has freed up the model to do what it is supposed to do. My persona files no longer describe how the system should function step by step. Now they describe how Eji should behave, which is what it’s meant to do.
The result is a coach that is more context-hungry and less cocky. It’s also no longer welded to a specific model. This is the architecture I have been describing across the sequence finally clicking into place.
Closer Edge was the conviction that coaching could be encoded into an AI system, and Eji was its broadened evolution. The skill split made it modular. Memory made it easy to use, since I no longer re-explain the same context every session. And now the harness is what makes it sharper even when the underlying model shifts.
The harness took the process load off the partnership, and the partnership got tighter for it. The stereoscopic vision that felt blurred when Opus 4.7 shipped is sharper now than it was before the upgrade.
What has emerged is a second brain that understands my context in ways I never could on my own. The iterative back-and-forth between us generates insight neither of us could reach alone.
The system got better when the process stopped competing with the partnership.

This is the 4th in a series about Eji, my personal AI negotiation and communications tool
-
The Eji System → komcp.com/shared-mastery-022826
-
Amplify Your Edge → komcp.com/amplify-your-edge-032326
-
Owning the Memory → komcp.com/own-the-memory-own-the-era-041326
If you want to try the universal Eji package or compare notes on what you’ve been building, reach out. jkoenig@komcp.com
This article was also posted separately on LinkedIn: https://www.linkedin.com/pulse/smarter-ai-isnt-more-reliable-jacob-koenig-5tj6e/