$ less prompt-caching-47k.md 2026-05-28 · ~9 min · Antwerp

prompt caching a 47K-token medical doc with Claude Sonnet

A reference manual that doesn't change, attached to a conversation that does, billed once.

The product is an AI recovery companion. The reference material is a 47,000-token medical handbook. Every conversation needs the handbook present — the model has to know what it's allowed to say, what to refuse, and which of the seventeen overlapping definitions of relapse applies to which clinical pathway. Without caching, each turn would re-bill the whole handbook. That math doesn't work for a product priced in dollars per month.

Anthropic introduced prompt caching for exactly this case: a large, mostly-stable prefix you want to keep hot across many short turns. The price you pay is a one-time write — usually about 1.25× the normal token rate — and then reads are about a tenth of the cost. The catch, and there is always a catch, is that the cache key is the exact bytes of the prefix. Move a comma and you pay for the write again.

what we actually cache

The handbook lives at the top of the system prompt as a single, immutable block. Below it: the conversational system instructions, which change rarely. Below that: the conversation itself, which changes every turn. We mark the cache breakpoint after the handbook so only the handbook's bytes are written once, and the cache hit fires on every subsequent turn for the same session.

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": HANDBOOK_47K,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": SYSTEM_INSTRUCTIONS
            }
        ]
    },
    *conversation
]

The first turn pays the write — about 58k input tokens at the inflated rate. Every turn after that, for the next five minutes, pays the cache read rate on those 58k tokens and the normal rate on whatever's new. At our average conversation length (about 14 turns over six minutes), the savings compound from "noticeable" to "load-bearing."

what can quietly invalidate it

The list is short and brutal:

Any byte change in the cached prefix — including invisible ones like Word's smart-quote autocorrect leaking into a content edit, or a YAML round-trip rewriting : to :.
The five-minute idle window passing. Quietly reset.
A different model version. claude-sonnet-4-5-20251022 and claude-sonnet-4-5-20260201 are different cache namespaces.
System-message ordering changes downstream of the cache breakpoint don't break the cache, but if you move the breakpoint, you've moved the key.

We bake a SHA-256 of the cached prefix into our build artifact and assert it on boot. If it doesn't match the value committed in the last release, the deploy refuses to come up. This caught two near-misses in the first month — both were a content editor "improving" a sentence that lived inside the cached block.

the one mistake that resets everything

The cache control marker has to be on the last block you want cached, not the first.

This is in the docs. I read the docs. I still got it wrong for two days. If you put cache_control on the first block of a multi-block message, you cache only that block; everything after it is uncached prefix and pays full freight forever. The cache hit rate tells you immediately — we shipped a small refactor that introduced a second block, the bill quadrupled overnight, and the dashboards quietly began telling a story I was too proud of to notice.

what i'd do differently

Treat the cached prefix as a build artifact. Version it, hash it, refuse to deploy without the hash. Log cache_creation_input_tokens and cache_read_input_tokens from every response and alarm on the ratio. The model is fast; the bill is slow; the gap between "this works" and "this works economically" is where products quietly die.

# compiled from production logs · anmol@anmol.be