Context Intelligence is the Key for AI Agents: March 2026 Benchmarks

AI agents are built in a disconnected way. Language models are trained to consume any input context, and software frameworks construct context for agents based on heuristics.

To make better agents, most resources are being spent on training models with bigger context windows and building better software. When models can't operate without a 512,000-line scaffolding system like Claude Code holding them together, we're not as close to AGI as we think.

There is a reason agent developers had to rely on very heavy frameworks. As agents act, their context window fills with tool outputs, failed explorations, and stale reasoning. The signal-to-noise ratio keeps dropping, and the model loses track of the original goal. Performance degrades due to context rot. We engineers needed to spend considerable resources in building complex context management systems to feed the models with the necessary and concise context. The performance of agents depends on the quality of context.

We are building the missing part - let models control their own context - be their own “neural harness.” Since we published a research preview last year, we’ve scaled evaluation to three established agent tasks that are reshaping the world: browser control, computer use, and coding. We prove that our approach fundamentally solves the core problem that has brought many challenges in agent building, including context rot, memory loss, and compute waste.

We trained models with context intelligence

Our models control their own KV cache during inference
Concise context is maintained for long horizon tasks

Context intelligence unlocks agent productivity

Led to frontier-level performance on BrowserBench, OSWorld, and SWE-Bench-Lite.
Increased inference speed (token per second per request) by 50%.
Saved us 75% compute for model training.

These benchmarks are a snapshot of what’s possible, and we’ve seen evidence our agents engines are extremely capable in scenarios where they need to explore, take action, prune, and repeat. This is the pattern in browser use, computer use, and coding, but expands to virtually all other complex agentic use cases.

Model-driven Context Intelligence

To recap, our thread inference model (TIM) structures agent reasoning as a tree of subtasks rather than a linear chain of thoughts. When a subtask completes, the runtime (TIMRUN) collapses the full reasoning trace, including all the intermediate steps and tool calls, into a compressed state. The corresponding KV cache entries are pruned from GPU memory.

We introduce two TIM engines: a compound engine that operates frontier models for state-of-the-art performance, and the uber-efficient yet capable unified engines where TIM models are trained and hosted end-to-end on our workstations. In the benchmarks below, we show how our engines achieve state-of-the-art performance AND efficiency without extensive prompting or software harnesses, and that removing noise actually helps models focus, especially on long-horizon tasks, while significantly improving compute utilization.

Evaluation

We evaluated TIM on three tasks: computer use (OSWorld), browser use (WebArena), and coding (SWE-Bench-Lite).

Browser Control (WebArena)

WebArena tests agents on tasks on real websites such as Reddit, GitLab, shopping, or maps. The agent needs to complete multi-step goals by navigating pages, filling forms, searching, etc, and is graded by functional correctness. These tasks are hard due to dynamic content and multi-tab navigation challenges from real-world websites.

The TIM compound engine reaches 70% on our test set of 100 instances, outperforming Claude Code and OpenAI Operator. The gap is particularly notable on navigation-heavy and form-filling tasks, exactly the multi-step sequences where context accumulates fastest. While other agents may lose track of which fields they’ve already filled or which pages they’ve visited, TIM’s subtask pruning keeps only the relevant state without expensive compaction, letting the model act on clean context at every step.

Computer Use (OSWorld)

OSWorld is an open-ended benchmark of real-world computer tasks executed in live Ubuntu desktop environments, including file management, web browsing, office productivity, and multi-app workflows. The agent observes screenshots and acts via keyboard and mouse. Tasks include complex multi-step (100+) workflows such as researching a topic across multiple apps and compiling the findings into a document. We test on 100 such tasks.

TIM Compound scores 68%, edging out Claude 4.6 Sonnet and GPT 5.4. Computer use is the most context-intensive of the three benchmarks— every action generates a full screenshot, and complex tasks can run over hundreds of steps. This is where efficient context management pays off.

Coding (SWE-Bench-Lite)

SWE-Bench-Lite tests agents on resolving real GitHub issues from popular Python repositories. It requires deep code comprehension and multi-step debugging workflows across large repository contexts. This benchmark tests whether TIM’s context management generalizes from GUI-based computer and browser use to coding and using developer tools on computers.

TIM-9B, our unified engine, achieves 53%, almost matching Claude 4 Sonnet and significantly outperforming similarly-sized state-of-the-art open-source models. A 9B model matching last-generation frontier performance on a coding benchmark is a direct result of co-designing the model with its runtime. TIMRUN lets the model shed explored-and-discarded code paths from context, focusing attention on the lines that matter. Together with computer- and browser-use, these results demonstrate that TIM engines deliver strong performance across a suite of computer tasks.

Inference and Memory Efficiency

Context compression reduces inference latency. On reasoning tasks that require more than 10,000 tokens to solve, we see a very significant improvement. Serving a 80B-parameter model on 8 H100 GPUs, the decoding speed with compression is almost 50% higher decoding speed than state-of-the-art open-source inference engine. With strong reasoning capability, our context-intelligent model solves complex reasoning tasks more efficiently, consuming less GPU memory.

Memory Efficiency in Training

Context intelligence not only benefits inference. The context compression technology saved us 75% compute budget in LLM training. Traditionally, it’s extremely expensive to train models on long-horizon agent traces. With standard training frameworks, each GPU node can barely handle 16,384-token context budget, and we would have needed four nodes to train our model on 65,536-token agent traces. Our context compression technology solves the problem and 4x-ed our training efficiency. We are heavily dog fooding our context engine in training - we can train the over-65k-token reasoning traces on one GPU node, which significantly sped up our training and iteration, with massive cost savings.

This discovery will be extremely useful in post-training our system for specific agent use cases and customers with less compute.

Looking Forward

Structured context management improves both performance and efficiency. This improvement is not a tradeoff, but a Pareto improvement. The same mechanism that saves tokens also makes agents better at their jobs.

What this means in practice: agents that can reliably complete long, multi-step workflows (filing insurance claims, onboarding into a new codebase, navigating government portals, managing inventory across systems) without degrading halfway through. Tasks that today require human supervision or manual checkpoints can become fully autonomous. And because TIM does this with fewer tokens, not more, the cost of running these agents drops as their reliability goes up. Efficient and accurate device use is what turns AI agents from demos into infrastructure: software that works unsupervised, at scale, on the tasks people actually need done.

We're scaling to harder benchmarks, longer horizons, and new domains. Stay tuned for more!

Experiment with our API today.