Back to Blog

Your AI Agent's Memory Is Probably Broken. Here Is Why.

By Rohit Ghumare • May 9, 2026 • 15 min read

I have an agent that ingests my GitHub stars every morning and turns each repo description into a memory. It is supposed to give the agent context about what I have been reading. After a month of running it, the memory store had 2,400 entries. I had starred about 60 repos in that time.

Forty entries per repo. Same content, different IDs.

The memory layer was doing its job exactly as designed. Every time the cron ran, it re-imported every star, generated a fresh UUID for each one, and wrote it. Forty mornings, forty new memories per repo. The retrieval layer kept fishing out duplicate copies of the same fact, occasionally with stale variations. The agent started treating my memory store as untrustworthy and stopped consulting it.

This is the most common bug I see in production agent memory layers. It has a name and a fix. The bug is non-content-addressed identity. The fix is fingerprint IDs and reinforcement-on-write.

This post walks through the failure mode in detail, then through the three popular memory layers — Letta, Mem0, and the open-source agentmemory — and how each one does or does not solve it. If you maintain an agent in production, the fix takes about an hour and removes a class of bugs you may not have noticed.

The shape of the bug

Agent memory is stored as records. A record has an ID, some content, and metadata. Most memory APIs look like this:

memory.remember({
  content: "Rohit starred openclawai/openclaw on 2026-04-30",
  tags: ["github", "stars"],
})

Behind the scenes, the memory layer generates an ID. Almost universally, that ID is a UUID or a database autoincrement — a value that has nothing to do with the content. Two calls with identical content produce two records with different IDs. The system has no way to tell they are the same fact.

This is fine if you only write each fact once. It is broken if your input source is replayable. Cron jobs are replayable. Webhook re-deliveries are replayable. Restarts are replayable. Anything that can run twice on the same input will produce duplicates.

And almost everything in a real agent stack is replayable. Re-imports happen because:

  • A scheduled job re-fetches the same source every day.
  • The agent crashes, restarts, and re-processes the queue.
  • You change the prompt, want to re-run on existing data, and re-emit memory writes.
  • You backfill from an export.

By the time the system has been running for a month, the duplicate count is usually two orders of magnitude above the unique-fact count, and the retrieval quality is wrecked.

Why retrieval breaks first

You might think duplicates are a storage problem, not a quality problem. Disk is cheap. Just leave them there.

The reason this argument fails is that retrieval ranks by similarity, not uniqueness. When you ask “what has Rohit been reading lately,” the search returns the top-k records by embedding distance. If forty of those records say the same thing, your top-10 result list is two unique facts and eight copies. The agent reads ten records and walks away with two facts. Recall just collapsed by 80%.

Worse, embeddings are not perfectly stable across model versions. The forty copies of “Rohit starred openclaw” will have forty subtly different embeddings if they were written across an embedding-model upgrade. Some will rank higher than the actual fact you want. Now the duplicates are not just noise; they are pushing the real facts out of the result set.

Memory staleness compounds this. Mem0's research notes that detecting when a high-relevance memory has gone stale is an open problem. With duplicates, even the “current” version of a fact has to compete with thirty old versions for retrieval rank. The decay layer can only do so much.

The fix: content-addressed IDs

The fix is small. It is one of the oldest tricks in distributed systems, and it is the same trick git uses.

For any fact derived from an external source you can re-fetch, the ID is a fingerprint of the content. Same content, same ID. Two writes with the same fingerprint are the same memory by definition; the second write is not a duplicate, it is a touch on the existing record.

In the simplest implementation, the fingerprint is a SHA-256 of the canonical fact string:

function fingerprintId(facts: { source: string; key: string; value: string }) {
  const canonical = `${facts.source}::${facts.key}::${facts.value}`
  return "fp_" + sha256(canonical).slice(0, 24)
}

For a richer implementation you also strip whitespace, lowercase identifiers, sort tags, and version the schema so a future change does not silently fork all your IDs.

On write, the API is now upsert instead of insert:

memory.upsert({
  id: fingerprintId({
    source: "github_stars",
    key: "openclawai/openclaw",
    value: "starred",
  }),
  content: "Rohit starred openclawai/openclaw on 2026-04-30",
  tags: ["github", "stars"],
})

Two calls collapse to one record. Forty calls collapse to one record. The cron job is now idempotent for free.

Reinforcement: turning a bug into a feature

Here is the part that surprised me. Once you have content-addressed IDs and your writes are upserts, you can do something the original system could not: count the writes.

Every time a re-import touches an existing fact, the upsert is a signal that the fact is still true in the source. That is information you would not otherwise have. The stored record can carry a last_seen timestamp, a seen_count, and a first_seen. Now you can:

  • Decay records the source has stopped emitting (the unstar case).
  • Boost retrieval rank by reinforcement, not just recency.
  • Detect freshness drift — a fact whose last_seen is two months stale is a candidate for review.

The upsert path is shaped like this:

function upsert(record: Record) {
  const existing = store.get(record.id)
  if (existing) {
    existing.last_seen = now()
    existing.seen_count += 1
    existing.content = record.content // refresh
    return store.put(existing)
  }
  return store.put({
    ...record,
    first_seen: now(),
    last_seen: now(),
    seen_count: 1,
  })
}

Twenty lines of code take you from “duplicates everywhere” to “reinforcement signal as a first-class field.” The downstream retrieval layer can use seen_count in the rerank step.

How the popular layers handle this

I went back through three memory layers I have used in the last six months and looked at how each one handles auto-derived records.

Letta (formerly MemGPT)

Letta is the most explicit about memory as a first-class component. Its model has core memory blocks and archival memory, with the agent using tools to manage what is paged in and out. Records have UUIDs by default.

Letta's “memory blocks” are designed to be edited in place — the agent uses core_memory_replace and core_memory_append tools, which give you something like idempotency for the in-context portion of memory. But for archival memory, which is the part you write to from cron jobs, the IDs are not content-derived. You have to layer the fingerprint pattern on top.

The fix in Letta land is to compute the fingerprint client-side and store it as a key in the metadata. The retrieval layer can then dedupe before returning. It works, but it is not the default and the docs do not push you toward it.

Mem0

Mem0's architecture is built around a fact-extraction pipeline that explicitly tries to deduplicate at extract time. From the Mem0 paper: the pipeline accepts a 6-percentage-point accuracy trade in exchange for 91% lower p95 latency and 90% fewer tokens, partly by aggressive consolidation.

That gives you deduplication for free in the “agent observed two facts about the user across two conversations” case. It does not give you deduplication for the “cron re-imported the same source” case, because the extractor does not know that yesterday's import is the same source as today's.

The fix in Mem0 land is to use the metadata.user_id + a deterministic key for source-keyed records, and to delete-then-write rather than blind insert. The Mem0 SDK exposes this; it just is not the path the docs walk you down.

agentmemory

agentmemory is the open-source memory layer I have been working on. It runs on the iii engine I wrote about in the first post in this series. It does content-addressed IDs as the default for any auto-derived record, with the upsert-and-reinforce path baked into the API.

import { remember } from "@agentmemory/sdk"

await remember({
  fingerprintId: {
    source: "github_stars",
    key: "openclawai/openclaw",
    value: "starred",
  },
  content: "Rohit starred openclawai/openclaw on 2026-04-30",
  tags: ["github", "stars"],
})

Pass fingerprintId instead of letting the SDK generate one, and the write becomes idempotent. The store carries first_seen, last_seen, and seen_count automatically. The retrieval layer rerank already considers seen_count.

I am not pitching the package; I am pitching the pattern. If you do not want another dependency, the same pattern is forty lines on top of any KV store. The point is that auto-derived records are a different category of memory from agent-observed ones, and they want a different write path.

Where the pattern does not apply

Content-addressed IDs are not the right answer for everything.

Conversational observations. When the agent infers something from a chat (“the user prefers terse responses”), the fingerprint is harder to compute because the same observation can be expressed many ways. You want fact extraction with semantic clustering — what Mem0 does — not content-addressed IDs. Use both. They solve different parts of the problem.

Time-series facts. A weather observation at 14:00 today and at 14:00 tomorrow are different records, even if the temperature happens to be the same. Include time in the fingerprint or do not use a fingerprint at all.

Ground-truth corrections. If the agent learns “the previous fact was wrong, the new one is correct,” you want a write that tombstones the old record, not an upsert that overwrites it. Track corrections explicitly.

The thing I keep coming back to

I think this matters more than the framing suggests, because the duplicate-on-replay bug is the kind of thing that does not show up in any benchmark. The benchmarks all use a fixed corpus. They measure recall on a snapshot. Production is a stream, and a stream that is a few months old is a different organism from a fresh corpus.

The memory layer that performs well on LongMemEval today might fail in your production after sixty cron runs because the writes did not have content-addressed IDs and the retrieval layer is now picking duplicates. The benchmark cannot see this.

The fix is small enough that you should just do it. Compute fingerprints for any record that comes from a source you might re-import. Use upserts. Track reinforcement. Watch the duplicate ratio in your store and make it part of your dashboards.

Memory is the part of an agent system that compounds. A chat handler can be replaced. A retrieval index can be rebuilt. A memory store with a year of corrupted writes is the only thing that does not get better with effort. Get the write path right.

What I would do tomorrow

If I were starting an agent project today and the memory layer was a TODO, I would do this:

  1. Use whatever memory layer you like for conversational observations. Mem0, Letta, vector store with a thin wrapper — they all work fine for the “agent inferred this from chat” case.
  2. Wrap it with a second write path for auto-derived records. That path takes a fingerprint ID and does upsert-and-reinforce. Forty lines of code.
  3. Add a seen_count tiebreaker to your retrieval rerank. Free signal once you have it.
  4. Track duplicate ratio (records sharing the same fingerprint) as a metric. It should be near zero. If it is climbing, your write path has regressed.

That is it. The whole pattern fits in an afternoon. The recall improvement is large enough that you will notice it in agent quality the same week.

I have one more post coming in this series. The next one looks at how an agent should choose between writing to memory and writing to durable state — the line between “remember this” and “persist this” turns out to be subtler than I thought, and I have changed my mind on it twice in the last two months. If you want the followup, the blog index has the rest.

Looking for expert guidance with your DevRel initiatives?