I wrote a couple months ago about memory being infrastructure and building your own tools. Since then I’ve been deep in the implementation. This post is about the patterns that actually worked.
The short version: the LLM is one component. The architecture around it is where the real leverage lives. And “architecture” isn’t a hand-wave. It’s a set of specific, solvable engineering problems that most people skip because they’re busy optimizing prompts.
The Decision Architecture
Everyone talks about prompt engineering. Almost nobody talks about decision architecture: the layer that determines what the model sees, what it remembers, what it ignores, and how it recovers when things go wrong.
This is the real product. The model is the engine. The harness is the car.
Some of the decisions that matter:
- What gets captured versus ignored, and the budget for making that call
- How context degrades over time, and what you do about it
- How confidence is scored based on source
- How contradictions are detected and resolved without human intervention
- How the system recovers when a service goes down
None of these require a better model. They require better infrastructure around the model.
Pattern 1: Contradiction Detection
Here’s a problem nobody talks about. You tell the system “we use Redis for caching” in January. In March you tell it “we switched to Memcached.” A naive memory system keeps both as equally valid. Now your agent has two contradicting facts and no way to know which one is current.
The fix is a supersession system. When a new observation shares a key with an existing one, the old record gets marked as superseded automatically. A graph link preserves the history. Retrieval only surfaces the current truth.
This is deterministic. No LLM involved. Memory keys (like project:kind:slug) make the matching mechanical. The system doesn’t guess. It knows.
Pattern 2: Not Everything Is Worth Remembering
Raw capture from an AI coding session is incredibly noisy. Stack traces, greetings, file reads, glob results. If you store everything, your retrieval gets polluted with garbage.
My first version stored everything. Every tool call, every file read, every throwaway message. Within a week the database was full of noise and retrieval was useless. The system confidently surfaced a three-day-old stack trace instead of the architecture decision I actually needed. More memory made the agent worse, not better.
I landed on a multi-tier filter pipeline. Fast regex checks first, free and sub-millisecond. Then similarity checks against recent captures to catch near-duplicates. Then deeper classification only when the cheap tiers can’t decide.
Around 60% of raw events get filtered before they touch the database.
The key insight: budget your expensive operations. The deep classification tier has a hard cap per session. If you blow through it, you fall back to cheaper heuristics. This prevents a noisy session from burning through your LLM budget on classification instead of actual work.
Pattern 3: Multi-Backend Retrieval
Single-backend search always has blind spots. Vectors miss exact phrases. Full-text misses semantic similarity. Keyword search misses both.
What works: run multiple backends in parallel and fuse the results using Reciprocal Rank Fusion with adaptive weights. A question gets higher semantic weight. A quoted phrase gets higher text weight. Short keywords get balanced weights.
Then post-fusion boosts for importance, confidence, time decay, and context relevance.
The weighting matters more than people think. A static 50/50 split between semantic and text search performs worse than adapting per query type. The adaptation logic is simple. The improvement is significant.
Pattern 4: Context Is a Gradient
This one took me a while to figure out. I started with a binary approach: either load the full context or don’t. The agent would either know everything about a project (and blow the context window) or know nothing about it. There was no middle ground, and it made the system feel brittle every time I switched topics.
The fix: instead of “the agent has context or it doesn’t,” treat context as a zoom level.
Some things need full detail. Some things need a one-liner. Some things just need a name so the agent knows they exist. I use a viewport model where the focal topic gets full detail and everything else fades progressively. When you switch topics, the viewport reorients.
The agent always has peripheral awareness without blowing the context window. It knows that a project exists and roughly what it’s about, even if the current focus is somewhere else. That peripheral awareness is the difference between an agent that feels like it knows your world and one that only knows the current conversation.
Pattern 5: Workspace Lifecycle
Not all memories have the same shelf life. A core fact about your identity should never expire. A working decision about a current sprint should stay active for a few weeks. A session-scoped note should archive itself after a few days.
I use a workspace system: durable, active, ephemeral, archived. Each category has different retention rules and different default importance scores. The system manages lifecycle automatically. You don’t manually clean up old memories. They demote and archive based on access patterns and age.
This solves the “memory pollution” problem where three-month-old working notes rank alongside current facts. Time decay alone doesn’t solve it. You need categorical lifecycle management.
Pattern 6: Build Model-Agnostic
This is the one I feel strongest about. Build the memory layer as a service, not a plugin. It should work with any agent that can make API calls. Same memory, same retrieval, same context rendering, regardless of which model is driving the session.
If you couple your memory to a specific model’s plugin system, you’re locked in. When a better model drops, you can’t switch without rebuilding your entire memory infrastructure.
I run the same system across multiple AI coding agents. The model slots in and out. Everything else stays. The harness is what compounds over time. The model is just the current best option for the engine.
The Bigger Point
Most people building with AI are optimizing prompts. That’s fine for simple tasks. But for anything that needs to work reliably over weeks and months, across thousands of interactions, with messy real-world data, the prompt is the least interesting part.
The interesting part is the system the model operates inside. The capture pipeline. The retrieval fusion. The context rendering. The contradiction detection. The confidence scoring. The offline resilience. The lifecycle management.
That’s the product. The model is a dependency.
Build the harness. The model will keep getting better on its own. The harness is what only you can build.
If you want the philosophical version of why this matters, I wrote about that in Why I Build. This post is the practical sequel.