The End of the Agent Dev Paradox

What frameworks and tools do we need to build an AI agent? This is a trillion-dollar question. For us developers, there are a million options floating around the market. In the last few months we’ve spoken to over a hundred enterprises and vertical AI companies about this question, and we found that the vast majority are unsatisfied with their options and over 60% choose to build their own internal frameworks from scratch.

I am one of the agent developers confused by all the hype and noise. I’ve just found my way forward, so I’m writing down some opinions and hope they can be helpful for my developer friends:

The Problem: A painful paradox in agent development
The Reason: “No free lunch” trap created by agent frameworks
Our Fix: Model is not a brain, it’s the swarm. This is what our the v2 TIMRUN enables.

The Painful Paradox

To build AI agents from toys into production systems we face a paradox.

We want flexibility and rigidity, we want a do-it-all blackbox but with complete observability, and we want tools for fast prototyping and performance tuning all at the same time in the same agent.

We all have experienced this decision process: if we need more control in an AI application, we should build a workflow piecing together language model calls and data/tool access. If we want more flexibility or “intelligence”, building a ReAct agent might be a better choice.

Different from software-driven systems, language models are nondeterministic. Balancing between deterministic and nondeterministic behaviors, agent developers need to pick the framework that works in the favor of their preferred trade-offs. Selecting the framework that aligns with your trade-offs is painful.

Before with deterministic, traditional software tools, any upgrade meant the system got stronger, and the developer could still fully control and observe their applications. There are trade-offs between selecting different software frameworks, but the design decisions mainly impact functionality and the user experience.

For the first time, we have a technique that seems very flexible and capable but at the cost of giving up observability and control. Any design decision made around this trade-off will determine if the application is useful or trustworthy at all - and that’s very painful.

Flexibility and Control

If you’ve worked with a product manager or designer before, you’ve experienced this picture. This is how our product managers treat us, and we wish our AI agents could be instructed in the same way!

We were advertised that language models can generalize to unseen tasks, and they can also follow instructions. However, when we get to building agents, we are told that developer control hurts the model’s flexibility.

As a result, most agents and agent framework are optimized for one of the preferences, flexibility or control. Some help developers build rigid workflows for fine-grained control: let’s enumerate all possible states, transitions, and edge cases and put together a system with 1% non-hardcoded ingredients. My own Ph.D. thesis in 2021 has a chapter on piecing BERT and GPT-2 together in this way. It feels braindead if we have to do the same thing 4 years later, especially after AI labs burned $100 billion to gift us these huge, powerful models. We can do better.

The other approach is to give the model as much flexibility as possible: just letting the model run in chain-of-thought / ReAct framework and waiting for frontier labs producing stronger models is all you need! If we try to control the model, we are dumb because “less control / structure, more intelligence”. We can certainly cherry pick some aha moments where such approach works, but how many times you can get the work perfectly done against 1000 inputs in production?

The most visionary companies are exploring both routes. OpenAI’s agent kit asks for pre-defined workflows to execute agent loops (control), and OpenAI Agent SDK went after the ReAct framework (flexibility). It’s either they want to cover different applications, or they just have no idea if there’s one right direction to commit.

Blackbox and Observability

We love it when language models handle very complex tasks and we don’t have to worry about or even understand the details. For example, summarizing a big book, or vibe coding a website.

Frontier language models are evaluated on extremely challenging tasks, hitting 100% accuracy or winning the gold medal in IMO to prove their capability in these complex reasoning tasks. For example, the average reasoning length for solving AIME challenges is 16,000 tokens. Most people don’t care the exact reasoning trajectories. We just trust the models are good at math since they achieve very high accuracy.

But can we trust IMO winner models in our own agent tasks? Does our model read all chapters in the book before generating the summary, or it stops reading in the middle and hallucinates the ending? Does our model actually run the test code, instead of simply reading the codebase and finds it seems correct?

While latest language models are trained to be blackbox systems solving complex tasks with scaled test-time reasoning, agent developers cannot use them as blackboxes in production when those questions become crucial.

Again, various products are designed for developers with different needs. Ideally, applications that only care about the output should be settled with frameworks that emphasizes the blackbox natural of LLMs to make the best use the flexibility of language reasoning. Other tasks, however, should lean on products that provide good observability. In theory, this should have already become the norm of agent development. However, very few teams are satisfied with any existing framework so they still have to reinvent the wheel.

Fast Prototyping and Performance Tuning

It’s always fun and valuable to build prototypes and PoCs very quickly. My first agent project, Anchoring AI, supports vibe generating agents with structural I/O. It did save me tons of time writing the initial prompts and workflows for prototype agents.

However, the one-sentence-generated agents are never satisfying for several reasons. That “one sentence” is never dense enough to encode all my needs - otherwise, the actual system prompt of my agent could be compressed into one sentence so I won’t need an agent generator. As a result, I always need to dive into the generated system prompt and workflow, keep revising the details until the agent can handle expected and unexpected user inputs.

When we start doing this, the value for fast prototyping quickly vanishes - my development flow converges to the regular struggles again and the entire engineering process will cost myself roughly the same time.

We see so much value in fast prototyping and fast production. However, the assumption here is still that when a stronger language model is embedded into the “one-sentence agent generation” framework, the produced agent will be way better. In fact, the improvement made by frontier models is slowing down, and agent developers never wait. In live production environments, many engineers are building their own agent framework to solve the problems they have in agent development, without any fast prototyping attempts. The one line agent doesn’t scale up, and the custom framework agent doesn’t scale down.

The “No Free Lunch” Trap

As a developer building products for developers, I believe developers are never wrong about their needs. However, agent framework products are dealing with us differently. since no existing framework fulfills our ask, all of them tell us that the pain we are facing is inevitable. “There is no free lunch”, meaning that we have to choice if we want flexibility OR control, Blackbox OR observability, fast prototyping OR performance tuning.

For most developers, after giving up some desirable properties to adopt a framework, they still can’t put together a capable agent. Accepting the “no free lunch” theory leads them into the “no lunch” maze:

give up flexibility for control → agent becomes too rigid for edge cases
give up control for flexibility → agent loop becomes inconsistent
give up observability for blackbox → agents become untrustworthy
give up blackbox for observability → hurts the reasoning ability
give up performance tuning for fast production → end up with a toy
give up fast production → wasting too much time reinventing the wheel

I struggled in this mud myself and have hit every wall. With traditional agent frameworks (think LangGraph, n8n), there are only two things I can really do:

Hype up my agent, try new frontier models, and cherry pick examples

Keep patching prompts, workflows, and fallbacks to adapt to overwhelming edge cases

After all struggles and explorations, we finally located the core of these pains: CONTEXT.

Transformer-based LLMs are always bounded by context limits. Even for those models supporting 1 million context tokens - it does not make sense to decode every token based on a million cached context tokens. This means that developers need a smarter strategy to manage the agent context.

Unfortunately, existing context engineering strategies are so intuitive that they actually lead developers into traps. There are two types of such mistake, covering most of today’s agent: workflow / multi-agents that pieces together LLM calls with software, and ReAct / single-agents that rely on software involving other models for context engineering.

Problems with ReAct / Single-agents

Chain-of-thoughts reasoning and the ReAct frameworks might have found the most straightforward way to unlock the intelligence of language models in reasoning and using tools. All reasonings steps are appended to the message list and the next steps are generated based on the increasing context.

However, auto-regressively increasing the context length of the model leads to in-efficiency and hallucination when the message list contains a relatively high number of reasoning / tool call steps. To solve this problem, the message lists often need compressing. When the context length reaches a certain threshold, it is fed into a context summarization model and the application will receive a compressed context. Agents working in such framework are flexible enough but lack control, observability, and tools to push the reasoning performance.

Problems with Workflows / Multi-agents

Modularizing the system is another intuitive approach to work around the context issue. Since the context of one language model is limited and the agent needs to process longer context, distributing context tokens into different language models sounds like a smart solution, given that all language models are “intelligent”. The interactions between different models are pieced together via pre-defined roles and hardcoded state transition logics so each module only has to focus on their concise context.

Such multi-agent framework prevents exploding context and also creates a discussion around multi-AI collaboration. However, software-defined tool and context orchestrations are too rigid. Under this setting, language models lose the flexibility to adapt to different inputs, instructions, and work around flexible edge cases. When the agent needs to interact with a large number of tools, explicitly building such workflow becomes very overwhelming for developers.

Since it’s impossible for software-defined context/tool orchestration to provide control/observation + flexibility/generalization at the same time, developers always run into the trap and have to deal with the “no free lunch” problem by putting together LLM API calls from scratch.

Launching TIMRUN v2: Models-driven Orchestration

Developers always need seemingly contradicting capabilities in one agent, so they never satisfy with one product that is designed for one side of the trade-offs. For instance in a browser-based deep research agent, I want to give the model the full flexibility to use the search engine and whatever URL it finds relevant, but I also need to force it to scroll towards the end of the webpage while reading to confirm if any helpful information can be found or not.

With traditional agent dev tools, developers need a ReAct-based framework to chain up search and web reading tools, and build a rigid workflow for the web reading and summarization module to ensure the competence of web reading. This brings two challenges for developers. Firstly, they have to either merge two different agent building scheme into one product or just make their own wheel to support each module. Secondly, building the web reading tool as a pre-defined workflow means giving up the simplicity and flexibility to deal with both single-hop and multi-hop information extraction tasks. It does make sense to ask for this behavior in the prompt but build the web reading module also in a ReAct framework, but we found the model cannot consistently follow the instruction because of the context anxiety problem.

We present the TIMRUN v2 to solve this pain for developers. Agents built with it are no longer bothered by paradox or trade-offs. Developers are now equipped with

Flexibility AND control
Blackbox AND observability
Fast prototyping AND efficient performance tuning

TIMRUN v2 serves a thread inference model (TIM) that is trained to (1) orchestrate its own context and tool calls and (2) generate recursive reasoning / agent loops guided by constrained decoding. Our system eliminates the need for an agent framework layer that constrains both model intelligence and development efficiency. Instead, agent building becomes one language model call:

from subconscious import Client

deep_research_client = Client(
    base_url = '<https://api.subconscious.dev/v1>',
    api_key='Get API KEY from <https://subconscious.dev>'
)

deep_research_client.build_toolkit(...) # provide search and web reading tool schemas

messages = [
    {
        "role": "system",
        "content": "You are a research agent that uses tools to gather information and provide a comprehensive answer to complex research questions. Use the search tools to find the urls for relevant articles, then use the reader tool to read and extract key information. Finally, synthesize the information to provide a well-rounded answer."
    },
    {
        "role": "user",
        "content": f"Research Question: {research_question}"
    }
]

response = deep_research_client.agent.run(messages)

This LLM API call completes a deep research agent loop with multi-hop tool use, ensuring fast prototyping for any agent. The reasoning process is logged and visualized on our web portal.

There is no pre-defined, software driven context and tool orchestrations. The TIM model generates its own reasoning and control follow that makes sure the model inference is based on optimal working memory. More details of our implementation can be found in our technical report.

Some other services also support server-side tool calling. However, developers are granted minimal observability and control since those systems are implemented with the ReAct framework with pre-built context compression. It also becomes challenging when developers need to push the performance of the agent since the only available approach for performance tuning is prompt engineering.

Building a Human-in-the-loop Agent with One LLM API Call

With TIMRUN v2, developers access to full control and observation. Besides treating the entire agent as a blackbox, we can explicitly and efficiently control the reasoning structure

define the task hierarchy, aka context compression logic
define available tools for each subtask
define where we need full control, where to let the model go ReAct
We give the model a “Human Feedback Tool” to collect user input during the agent loop

from typing import Tuple, List

search_task = subconscious_client.create_task(
    task_name='search',
    tools=('SearchTool',)
) # Returns 20 candidate webpages

feedback_task = subconscious_client.create_task(
    task_name='human_feedback',
    tools=('HumanFeedbackTool',)
) # Rerank, pick top 5, and let user decide which ones are actually relevant
    
get_relevant_results_task = subconscious_client.create_task(
    task_name='get_relevant_results',
    subtasks=Tuple[search_task, feedback_task]
) # After each search, **ALWAYS** let user decide the best search result

read_task = deep_research_client.create_task(
    task_name='web_reading',
    tools=['ReaderTool']
) # Defining a task that only uses the web reading tool

summary_task = deep_research_client.create_task(
    task_name='summarization',
    thought='Analyze the web reading result and find out if there is any information needs further search to expand.',
) # Defining a reasoning step that encourages "deep" research

rs_task = deep_research_client.create_task(
    task_name='research_step',
    subtasks=Tuple[read_task, summary_task]
)
# Each research step strictly contains two subtasks.
# Web scrapping results get pruned after each step is completed

search_attempt = subconscious_client.create_task(
    task_name='search_attempt',
    subtasks=Tuple[get_relevant_results_task, List[rs_task]]
) # Search and human feedback followed by whatever number of research steps

thread = deep_research_client.create_thread(
    reasoning_model=List[search_attempt],
    answer_model=str
)
# The deep research tasks calls search engine as many times as it need
# creates "default/default" (agent/thread_name) thread in the client

response = deep_research_client.agent.run(messages) # uses default thread

This code creates a deep research agent with both flexibility and human-in-the-loop, fine-grained control built in. On flexibility, we allow it generate as many search tool calls and read a flexible number relevant web pages after each search. On control, we force the model to always ask for user feedback after using the search tool. In addition, we easily append a reasoning step to encourage the agent to find helpful information in the webpages for deep explorations after each web reading. Context engineering is also automatically handled. Subtasks are pruned from the attention cache during the reasoning process, so the model always maintains the most helpful working memory. TIMRUN v2 is the only agent dev tool supporting both flexibility and control within a very efficient dev experience.

Each agent is also observable from our web portal (https://subconscious.dev/platform/logs). There’s no hidden details, but the reasoning traces are still easily tractable because of our structural inference model and run time.

The actual agent reasoning flow generated by TIMRUN v2 strictly aligns to the structure we defined in the code:

The control we asked for
- Always ask for user feedback (highlighted) after using search engine
- Always call reader tool in subtasks so they will be pruned in future reasoning
- Always think about if any information mentioned in the web pages can be further explored
The model is given the flexibility to use the search tool as many times as needed. However, even after 20 reasoning steps, the model still follows the instruction to call the user feedback tool after search.

The implementation of of this deep research agent is publicly available in our quick start repo. Install our python SDK, get your API Key, and kick off fully controlled, observable, and flexible agentic inferences!