Make Agent Frameworks Trainable

The Pain

Agent frameworks cannot generalize.

A bigger number won't make it easier

In 2020, two years before ChatGPT, I pitched at MIT Sandbox about “Any NLP task can be formulated as question answering”, and developed a multi-agent system that generated synthetic data for training question answering models [1]. The core of this system is generating challenging questions and automatically pairing them with high-quality answers.

I didn’t choose to make the system multi-agent. I had to because of the limited capability of language models. I used ELECTRA (a BERT-like model) for entity recognition and answer extraction, and BART (an alternative to GPT-2) for questions generation. I post-trained these small models for different purposes and then chain them up with software. It took me months to build and debug, but I was still excited because multi-agent collaboration and synthetic data were both fresh new topics at that time.

A pipeline of answer recognition, question generation, and confidence estimation.

In 2025, roughly two years after ChatGPT, language models significantly evolved. GPT-style LLMs have taken almost all NLP tasks, and tons of agent frameworks have been built for developers. In this year, I developed the “Cursor for data engineers” that transforms any unstructured dataset into indexed, agent-ready knowledge bases. I thought a multi-agent system would be much easier to build compared to 2020 given much stronger models.

LLM evolvement doesn't make agent dev easier.

I was totally wrong. The agent building experience improved 0% while LLMs have evolved, and might have become even worse than 2020. The complexity of context sharing, tool orchestration, and debugging took off during the development and went too much beyond my threshold.

More promises, more pains.

After spending a few very painful months trying different multi-agent frameworks and found none of them was helpful enough, I ended up completing a single-agent prototype inspired by Thread [2] shown in the following figure. This framework only maintains one message list for one agent, but manages the context by subtask decomposition and pruning. Controlled by few-shot examples and special tokens, this agent was able to use the data processing and indexing tools, and handle the test cases across a few different areas.

Thread reasoning framework. Tasks are recursively decomposed.

Can I make Thread another agent framework? No. Thread did not solve the core problem that bottlenecks agent building. The context-sharing and tool calls still need careful orchestration and the task decomposition strategy relies on task-specific prompt engineering. Developers working with Thread might still have to spend many hours building wheels again and again instead of focusing on their customers.

I realized that the most foundational limit of all agent frameworks is that they are software. Software cannot be post-trained for adaptation, and they always need manual configurations to deal with edge cases. Using a predefined agent framework is similar to hiring an external manager to lead a technical team. Sometimes it could work, in more cases, it wouldn’t.

Our Cure

Agent frameworks cannot be trained, so we burn them into neural language models with reinforcement learning.

Neural agents > bag-of-agents.

Neural networks can be fine-tuned and easily adapted to target applications. We built a model that generates the Thread-style reasoning trajectories, namely Thread Inference Model (TIM). Our motivation and vision are both very simple.

Neural networks outperform feature engineering, and neural agents outperform framework engineering in both flexibility and capability.
Word embeddings are more efficient than bag-of-words feature engineering, and neural agents are more efficient than bag-of-agents framework engineering.
A neural agent framework can easily adapt to edge cases, while framework agents require expensive manual efforts to cover all possible situations.

Neural agents are more powerful dealing with unexpected inputs.

With these thoughts, we built a neural agent that automatically manages its own context. Inspired by Thread, our model generates recursive reasoning trees and prune subtasks. The model computes a task-aware sparse attention against helpful context, improving both reasoning accuracy and efficiency.

For example, how to complete a Ph.D. program? The process is complicated, but once the thesis is approved, we know the next step is going to the commencement without computing the attention distribution to the entire journey. In LLM reasoning, many challenging problems can be decomposed into subtasks. When a set of subtasks are completed, the higher-level node can prune its subtasks and kick off the next step without information loss.

In practice, the context engineering is implemented natively in GPU during model inference. We implemented an inference runtime that dynamically prunes the KV cache during decoding as shown in the following figure. When the model is decoding a token in Task 2, Task, 1.1.1 and 1.1.2 have been pruned, including the tool call in 1.1.1. The model still maintains necessary context knowledge to continue the reasoning.

Subtask decomposition and pruning strategy of the Thread Inference Model (TIM)

To avoid pruning the wrong context and improve the generalization ability across different tools, we trained the model on over 16k tools and improved its task decomposition strategy with reinforcement learning. The model learns the context engineering method and adapts its reasoning policy accordingly. To our surprise, the model outperforms strong baselines without context pruning on complex reasoning tasks including AIME 2024 and GPQADiamond, proving the effectiveness of the intelligent context engineering approach.

The future of Agent Development

No software orchestrating different LLMs and tools. The LLM automatically calls tools.
No worries about context engineering. TIM handles your context intelligently
No dragging and linking for workflow engineering. One prompt nails it.
No redundant payments for cached tokens. One LLM inference handles multiple interactions.

Specialized Tools + Post-trained TIM = AGI

Make Agent Frameworks Trainable

Want to give it a try?