Subconscious + GLM-5.2 Makes "/compact" Obsolete

Picture this: You’re three hours deep into a complex refactor. Your coding agent has traced 100 steps across your code base, cloning the service, mapping dependencies, drafting the migration script. Then, right as it’s about to apply the final patch, it freezes. Context limit exceeded.

Long-context LLMs working as coding agents

Your coding agents can handle this. Slam /compact.

Then the agent wakes up, but it’s not the same. It forgot the original PRD retrieved during the work. It dropped the strict security constraints you set at step 50. It hallucinates a breaking API change. It's about to ruin your afternoon.

We’ve been told that massive context windows—1M, 2M, even 10M tokens—are the silver bullet. But here’s the dirty secret of long-horizon coding agents: bigger is not better. It’s actually slower, pricier, and more fragile. Processing a bloated history burns through GPU memory, spikes latency, and forces you to manually prune critical reasoning just to keep the agent running. The longer the task, the more you pay, and the more you lose. Also very importantly, we don’t want to pay for 5M input tokens in each LLM API call.

But what if your agent never needed a manual /compact?

What if it could run for hundreds of steps, retain every architectural constraint and user instruction that matters, yet automatically drop the exploratory dead-ends, all without ever hitting the context ceiling?

We found the fix.

We paired GLM-5.2’s elite reasoning with TIMRUN, our inference system optimized for long-horizon agents. TIMRUN is specifically tuned for agentic workloads and features the Subconscious Cache, an intelligent caching layer that preserves memory through compaction.

This system gives frontier open LLMs (GLM, Kimi, Qwen, Nemotron) the ability to self-compact on the fly. No manual triggers. No abrupt memory wipes. As the agent thinks, auto-compaction continuously trims the fat while preserving the critical context. The model never forgets the PRD, never loses the thread, and through all of this never misses a cache hit.

Think about that: Millions of tokens remembered, yet only 150k tokens actively processed each turn. That's the difference between a sluggish bot and a relentlessly fast coding partner.

No /compact. No memory loss. No context-limit panic.

Our system enables pure, unbroken velocity from the first prompt to the final merge.

The Context Paradox

The open secret of long-running coding agents: the larger the context window, the less accurate it becomes.

On paper, the math is simple. A 100-step coding agent trace with system prompts, tool definitions, file diffs, and terminal outputs easily balloons past 150k tokens. So to support more ambitious use cases, the AI labs responded with models with massive context windows: 1M, 2M, even 10M tokens. "Just fit the whole history!"

But in practice, this creates a paradox that breaks every agent trying to run beyond a few dozen steps. Let's walk through it.

1. Long-running agents desperately need more than 1M tokens.

A single deep refactor spanning dependency resolution, multi-file test generation, and iterative error recovery accumulates massive reasoning traces. The agent needs to remember the original PRD, the architectural constraints, and every failed hypothesis to avoid repeating mistakes. For truly autonomous, long-horizon tasks, 1M tokens isn't a luxury, it's the bare minimum.

2. Pushing past 1M tokens is an inference and economic nightmare.

Attention scales quadratically. Processing a million+ tokens per step crushes GPU memory, spikes end-to-end task latency through the roof, and destroys throughput. At commercial API rates, a single 50-step agent run can cost you several dollars—and that's before you scale to hundreds of concurrent developers. The math simply doesn't pencil out for daily engineering workflows.

3. If you swallow the astronomical cost, context rot renders it unreliable.

This is the cruelest part. Even when you pay top dollar for a massive context window, you don't get the memory you bargained for. Beyond ~200k tokens, the "lost-in-the-middle" problem kicks in. Early instructions decay. Critical constraints fade into noise. The PRD you set at step 10 is technically in the window—but the model can't retrieve it when making decisions at step 50. You're effectively paying a premium for a model with amnesia.

So where does that leave you?

Keep the full history? You bleed money on inference costs and GPU time, and the model still suffers from context rot.
Manually compact? You slash costs and speed, but you amputate the agent's memory, guaranteeing failure on complex tasks.

It's a no-win trap. Every developer using Claude Code or Codex has hit this wall. You either burn cash on bloated contexts, or you sacrifice intelligence on the altar of /compact.

Unless, of course, you break the paradox entirely and use TIMRUN.

TIMRUN enables model-driven context engineering

What if the context window wasn't a cage you constantly fight, but a dynamically managed workspace the model tends to itself?

That's the paradigm shift behind TIMRUN. Instead of forcing developers to play amateur memory-managers, we gave GLM-5.2 the agency to actively engineer its own context at runtime, without human intervention.

Here's how it works.

Model-driven Context Management

Traditional agents rely on heuristics like "keep the last N messages" or "summarize everything every 20 steps." These are reactive band-aids. They always lose critical signal in the noise.

TIMRUN flips the script. GLM-5.2 continuously evaluates its own reasoning history. When it detects a a resolved dependency or a discarded hypothesis, it prunes that chunk autonomously, right in the middle of execution. Think of it as fluid, ongoing housekeeping without a user prompt or a hard coded limit.

Subconscious: context without actual tokens

Here's where the magic happens, and where TIMRUN separates itself from every other "summarization" trick.

When GLM-5.2 prunes the middle of a reasoning trace, most inference engines pay a steep cache penalty. They lose the KV cache alignment and force a complete re-encoding of the remaining tokens, spiking latency by 4x and burning GPU cycles. Without those cached KV states, the model also completely forgets about the pruned tokens.

Subconscious Cache eliminates that penalty entirely.

Our system doesn't completely forget the pruned section. It retains the cached representations of the remaining KV states in the cache. When the engineered context is fed back in, TIMRUN seamlessly stitches the cached prefix and latest progress together with zero re-encoding, zero latency spike, zero cache miss.

So GLM-5.2 can prune aggressively, keeping only the active "working memory" lean and fast, while Subconscious Cache preserves the subconscious knowledge of everything that came before. The model gets the memory of a multiple million-token giant with the speed and cost of a 150k-token micro.

Model-driven ≠ heuristic-driven

Let's be clear: this isn't a sliding window. It isn't a fixed summarization policy. It isn't an external orchestrator deciding what to keep.

The model drives the compaction itself. GLM-5.2 decides prunable tokens, which tool outputs are obsolete, and which reasoning threads are safe to archive. The compaction is semantic, not mechanical. And because Subconscious Cache keeps the cached progress alive, GLM-5.2 can recall any archived information instantly—without paying the token cost to keep it in the active window.

What this means for your coding agent

You never hit /compact again. The model takes care of it, progressively, in the background.
You never pay for redundant tokens. The active context stays slim hovering around 150k tokens even after 100+ steps.
You never lose the thread. The PRD, the architectural constraints, the add-on instructions live on "subconsciously" and resurface exactly when needed.
You never stare at a spinning cursor. No re-encoding means sub-second TTFT, step after step.

This is context engineering at the model level. We believe it will be a fundamental rethinking of how agents handle memory and how language models should be trained.

With TIMRUN + GLM-5.2, the developer can focus on the logic of the task. The model focuses on the housekeeping of its own context.

Enable Subconscious for Your Coding Agents

The GLM-5.2 model hosted by https://www.subconscious.dev/ has enabled auto-compaction and subconscious cache to improve agent inference efficiency and long-term memory. You also get $50 credits to test the model performance

Get an API key on the website, and integrate with your coding agents.

Integrate with Claude Code:

export ANTHROPIC_BASE_URL=https://api.subconscious.dev &&
export ANTHROPIC_AUTH_TOKEN=YOUR_SUBCONSCIOUS_API_KEY &&
export ANTHROPIC_MODEL=subconscious/glm-5.2 &&
export DISABLE_AUTO_COMPACT=true &&
claude

Integrate with Codex:

export SUBCONSCIOUS_API_KEY=YOUR_SUBCONSCIOUS_API_KEY &&
codex -c model_providers.subconscious.name=Subconscious -c model_providers.subconscious.base_url=https://api.subconscious.dev/v1 -c model_providers.subconscious.env_key=SUBCONSCIOUS_API_KEY -c model_provider=subconscious -c model=subconscious/glm-5.2

Integrate with OpenCode

export SUBCONSCIOUS_API_KEY=YOUR_SUBCONSCIOUS_API_KEY &&
export OPENCODE_CONFIG_CONTENT='{"$schema":"https://opencode.ai/config.json","provider":{"subconscious":{"npm":"@ai-sdk/openai-compatible","name":"Subconscious","options":{"baseURL":"https://api.subconscious.dev/v1","apiKey":"{env:SUBCONSCIOUS_API_KEY}"},"models":{"subconscious/glm-5.2":{"name":"Subconscious","tools":true}}}},"model":"subconscious/glm-5.2"}' &&
opencode

Select the model by typing `model` and search for `subconscious`

For a wider array of templates, we have more examples for different use cases:

https://www.subconscious.dev/templates

We're excited to see what you build!