Subconscious Cache: Reliably Capture Your Agent Context

Every time an agent compacts its context, today's inference systems throw away work they already computed and re-encode it from scratch, exactly when the trace is longest and the user is least patient. Subconscious Cache is the solution. Built into our runtimes, it makes our models faster and more accurate on reasoning tasks, all in an OpenAI Completions and Anthropic Messages API compatible format.

TL;DR

General-purpose LLM inference systems waste compute on agent workloads. Every time an agent compacts its context, the prefix cache is invalidated and large spans of already-computed tokens get re-encoded, exactly when traces are longest.
Subconscious Cache fixes this by reusing a cached suffix as well as a cached prefix. When an intermediate message span is pruned, the information those tokens contributed survives implicitly in the retained suffix states rather than being thrown away and re-encoded. In this way, pruned information remains in the models "Subconscious" reasoning.
Concrete wins running Qwen3.6-27B with TIMRUN and the Subconscious Cache against SGLang: 50% lower latency per agent task (the same compute now serves 2x the concurrent users), up to 3x system throughput on long-horizon tasks, and 100-step agent tasks held under peak context. On 4x H100, the same system beats vLLM on both latency and throughput by a significant margin.
Auto Compaction lets our TIM models prune their own reasoning history at inference time. This no longer requires server-side tool calls or a recursive JSON format. TIM now compacts automatically inside the familiar chat completions and messages formats.
Results below are mostly on small language models. Frontier open models such as Kimi K2.6 and Nemotron are coming in the following weeks.

Background

The context problem for AI agents is far from solved, and it bottleneck both capability and efficiency.

On the capability side, modern agents rely on multi-hop tool calls, long reasoning traces, and self-healing from mistakes. Each of these inflates the message list. Even frontier LLMs with 1M-token windows begin to degrade as the window fills up, a phenomenon often called context rot. The standard remedy, context compaction, is itself lossy: it strips reasoning memory, intermediate instructions, and long-term constraints that the agent still needs. Open-source models feel this pain earlier and more frequently as an agent trace accumulates. State-of-the-art open alternatives such as Qwen and Nemotron cap out around 256k tokens, so agents built on them must compact more aggressively, and pay a steeper accuracy cost when they do.

On the efficiency side, the prefix caching mechanisms that modern inference systems rely on are helpful for building agents [Manus AI 2025, Cognition 2025], but still not efficient enough. Frequent context engineering constantly invalidates the cached prefix, forcing the system to re-encode large spans of tokens that were, in effect, already computed moments ago. Throughput collapses and latency spikes exactly when the agent trace is longest and the user is least patient.

In this article we introduce our approach to both problems: Subconscious Cache and Auto Compaction. Subconscious Cache extends prefix matching to also reuse cached suffixes, so that pruning a span of intermediate tokens no longer throws away the latent information they contributed and no longer forces a re-encode of everything that followed. Auto Compaction, building on our earlier work, lets the model itself decide what to prune at inference time, and relies on Subconscious Cache to preserve memory across those prunes. Together, the two mechanisms deliver better agent reasoning and substantially better inference efficiency.

Agents are bottlenecked by prefix cache

To partially prevent redundant computation, modern LLM inference systems preserve a prefix cache. When a new request comes in, it tries to match tokens via prefix lookup in the cache and reuse their KV pages and recurrent states (for hybrid models). This approach significantly reduces the redundant input encoding (prefilling), and thus improves the throughput for model servers and reduces latency for users and developers. Because re-computation is avoided, the frontier model providers only charge a fraction for cached tokens. To take advantage, frontier agent builders have been focusing on improving token hit rate for better efficiency (Manus article).

However, the prefix cache alone cannot effectively capture the entire context of an agentic task. As shown below, the long agent reasoning traces experience a pruning processing. When a new user message or tool response is appended to the message list, some old content is removed. This pattern happens in many use cases.

Assuming we are dealing with an agent loop. At turn k, the input consists of three sections of tokens: A, B, C, and the LLM generates the Output_k. In the next step, our agent harness appends new input D, but prunes tokens in section B during context engineering. This forms new input sequence for the LLM: A, C, Output_k, D. Such behavior is very common in modern AI application including chatbots and agents.

Chat: for long conversations, only keep the last few rounds. The removed section B can be a few user-assistant message turns.
Multi-modal reasoning: only keep the latest few multi-modal inputs. Each turn, section B can stand for an old image getting pruned from its message.
Coding: different coding agents processes context differently, but they all need to prune context to maintain a healthy context window and also try to maintain as much useful information as possible.

In LLM inference systems with a standard prefix cache, when section B is pruned, the cached hidden states of the following section C, Output_k becomes obsolete because there is no cached prefix A, C, output_k. As a result, only section A can be reused as matched prefix, but C, Output_k needs to be re-computed as new inputs together with section D, introducing two deficiencies.

Standard prefix caching. After section B is removed, section C is no longer a cache hit and must be recomputed.

Firstly, redundant computation decreases inference efficiency. For example, with a computer use agent, where the harness provides the latest 10 screenshots to the LLM and prunes out the older ones for context efficiency. With prefix caching only, each time one image is pruned, all following tokens (10 images, reasoning steps, user / tool inputs) need to be re-encoded.

Secondly but more concerning, the re-computed A C’ Output D trace no longer preserve the information from section B. Dropping contextual information often has one or more negative side-effects on the model, it:

Loses long-term memory
Stops following user instructions in a multi-turn interaction. E.g., the model forgets a user constraint set in a pruned chunk
Is more prone to hallucination

Subconscious Cache: Bridging prefix and suffix

Agent inference systems need to be aware of context engineering. We capture “engineered” context by detecting both prefix and suffix and reuse their cached KV and recurrent states. By doing this, the system preserves the information of pruned tokens “subconsciously” - they are no longer in the message list, but the states of suffix tokens carry their knowledge implicitly.

When section B is automatically compacted, Subconscious Cache renders a full prefix and suffix cache hit.

Standard prefix matching only recovers the longest cached prefix A of an incoming request: input = A · C · Output_k · D. Subconscious matching extends this by also reusing a cached suffix C · Output_k.

A Subconscious Cache hit fires with the following criteria (referred to as subconscious rules in following sections):

1. The cached chain can be precisely split into three sections A, B, C

2. The new input chain can be precisely split into three sections A, C, D, where A and C matches sections A and C in the cache, and len(C) > threshold

If these two conditions are met, our system recognizes that B is pruned and D is the new input. We reuse A, C, Output_k to encode D instead of re-encoding C, Output_k, D based on A only. This preserves more memory in long-horizon agent reasoning, reduces latency, and significantly improves cache hit rate.

How to manually trigger Subconscious Cache in practice:

1. Append the generated response and new input message as it is to the tail of the previous message list

2. Prune a continuous chunk of tokens from the message list. For example, one or more assistant-user / assistant-tool messages pairs, an image in a history message, a substring of a tool response

3. Send the new message list to the LLM api, you’ll see increased cache hit numbers.

Computer Use + Subconscious Cache

Computer use is a typical use case for Subconscious Cache, as is other multi-hop, multi-modal agent tasks. Because of the long-horizon nature of the task, a model needs to process tons of screenshots in each reasoning task, and in most cases we can only keep the latest images in the message lists. When previous images get pruned, we lose long-horizon visual memory, and every remaining image gets re-encoded each time because they cannot hit the prefix cache. Removing upstream images from the message list changes the prefix, causing a cache miss.

Subconscious Cache solves this problem. Even when previous images are pruned, the inference system will hit both the prefix and suffix cache since the following inputs satisfy the subconscious rules. This allows us to better maintain long-horizon visual memory (pruned images are encoded in suffix tokens), significantly increase cache hit rate, and improve reasoning efficiency.

In our experiments, we found this system can significantly improve the performance of computer use agents. Taking Qwen3.6-27B as the backbone model, we observed the following success rates from the OS World computer use benchmark.

Subconscious cache improves computer use agents by memorizing pruned visual inputs

The results show that Subconscious Cache significantly improves long-horizon multi-modal reasoning on computer use tasks by preserving historical visual signals. We will reveal more information about RedLine, our training system, built by Dr. Wei Fang in our next release.

Auto Compaction: Runtime Context Engineering with Subconscious Cache

Long-horizon agents push the language model context window ever longer, with costs on two fronts. Frontier LLMs do reason better with long traces, but once the 1M-token window fills, the standard /compact command - which summarizes prior context to shorten the input - becomes destructive: the agent forgets critical memory, instructions, and intermediate goals. Smaller models fail earlier; they lose track of the reasoning trace well before the window is full, producing the now-familiar long-context hallucination and context-rot failure modes.

Throughput suffers in parallel. Every extra token in the agent history is a token the inference system must attend to, allocate KV cache for, and decode against. As traces grow, concurrency on cloud GPUs drops sharply and consumer GPUs hit OOM long before the model's nominal context limit. /compact does not address either side of this problem - by the time the user (or the agent loop) decides to compact, the request has already paid the prefill and KV cost of a fully expanded context, and the summary that survives is lossy.

Building on Luo et al., 2025, we developed TIMRUN, an inference system that compacts the input message list continuously at inference time. Every time TIMRUN receives new messages from the agent loop, it scans for prunable subtasks in the running message list. When one is found, TIMRUN removes that span of tokens and proceeds on the pruned input. With aggressive pruning, the peak context of a 100-step agent task can be held below 32k tokens. At that time, our model had to enforce server-side tool call to enable this efficiency.

Powered by Subconscious Cache, the new TIMRUN no longer forces server side tool calls and is compatible with OpenAI chat completion and Anthropic messages API formats. TIMRUN prunes at most one contiguous span per step, every agent step lands cleanly on cache, hitting either the prefix cache (when nothing is pruned that step) or the Subconscious Cache (when context is compacted). The agent never needs to saturate its context window, and developers never need to call /compact by hand. Long-horizon runs keep going without context limits, without memory loss.

On programming tasks, the auto compaction mechanism improves coding agents. Below are results of running SWE-Bench-Lite with the OpenHands harness. Post-training on our Redline system further improves the results.

Auto compaction improves off-the-shelf models on coding, and Redline training further improves the model

Efficiency Improvements

We constructed three sets of experiments to compare our inference system powered by Subconscious Cache over baseline inference systems. We configured two scenarios: simple agent tasks (10+ turns, 8k tokens) and complex agent tasks (50+ turns, 128k tokens) running on SGLang and vLLM vs TIMRUN. We assess average task latency, time to first answer token (TTFAT), system throughput, and per-request throughput as key metrics.

We found our system consistently outperforms SGLang with lower latency, reduced time to first answer token, higher system throughput, and per-request decoding speed.

Running the Qwen3.6-27B model in consumer GPU environments, our system can handle bigger batches more efficiently. We report the following findings:

1. TIMRUN reduces 50% latency per agent task for each user using the same compute. It means that inference providers can use the same compute footprint to serve 2x users (6 to 12, 8 to 16) without increased latency per agent task.

Compacting context in inference time decreases per-task latency for long-running agents

2. On long reasoning tasks, TIMRUN achieves significantly higher decoding speed per request:

Agents using subconscious auto compaction achieves higher per-request decoding speed and better memory efficiency

3. TIMRUN handles bigger batch sizes more efficiently than SGLang, achieving up to 3x system throughput for long-horizon agent tasks using the same compute

Auto compaction with Subconscious Cache significantly boosts system throughput

4. On product-level cloud deployments with enough GPU memory (4x H100), we found TIMRUN also significantly outperform vLLM in both per-task latency and system throughputs.

What's next

As a preview, we show improved small language model performance and efficiency on dedicated H100 inference endpoints, but this is just the beginning. We're excited to apply this technology to larger open source models like Kimi K2.6 and Nvidia's latest Nemotron models and work with AI labs and neoclouds to accelerate their inference workloads.

Below is a first peek at Kimi K2.6 performance using Subconscious Cache on a B200 instance. Stay tuned for more!