Inference Systems Designed for Agents
Built for agents, not chatbots. With our inference runtime, support more concurrent workloads with faster throughput and longer context reasoning. Bring your own models or use ours.
Intelligent, on-GPU context compression tuned for agents
Proud to partner with
Enhanced open models
Use open models, enhanced for agentic workloads
The same open model goes further on our inference system, so you get the best capability for less.
Choose an open model, or bring your own. Enhance it with our inference system.
A chatbot writes one answer. An agent keeps working, often for hours, across millions of tokens and hundreds of steps.
Agents are the most valuable way to leverage AI, but they're the hardest workload to run in a way that is reliable, fast, and affordable. Our inference system makes efficient, capable, and fast agentic work possible.
Performance benchmarks
Measurable improvements with our TIMRUN runtime
2.3×
Concurrent workloads on the same hardware
TIMRUN enables more agents to run on the same GPUs. More concurrency means better economics.
3.5×
Faster token throughput with long context
TIMRUN moves far more tokens through the same GPU, especially with 200k+ tokens of context.
10×
Context window extension
TIMRUN enables agents to understand more data. It automatically compresses the KV cache at runtime, so the model stays reliable on long tasks instead of degrading mid-run.
TIMRUN vs. vLLM, measured on identical hardware and models
Deployed wherever you run agents
Power your agents with less GPUs and get better performance.
For
Coding Agents
Power coding agents with our API.
Your engineering team needs coding agents, now get frontier performance with on-premise hosted models. Swap the base URL and the same agents run faster with our TIM and TIMRUN runtime at a fraction of frontier pricing.
For
Agentic Products
Power agentic products with our API.
You want to ship agents to customers. Our efficient runtime keeps AI costs nearly linear instead of quadratic across multi step processes, so you can launch to actual users beyond an internal demo and turn a profit.
For
Edge Devices
Run capable agents on edge devices, for the first time.
You already have a GPU. Take our runtime alone or paired with our post-trained models, and run capable agents completely on-device. These workloads are not possible in memory and compute constrained environments without us.
For
Inference clusters
A drop-in replacement for vLLM and SGLang.
You run inference on GPUs you control. Swap Subconscious in where your serving engine sits today and the same fleet can run more concurrent agents, finish long jobs that used to degrade, and push tokens out faster. That means more revenue per GPU for clouds and more capacity for enterprises.
Subconscious visualized
Longer runs and more concurrency on the same GPUs.
The runtime manages memory and context during long agent runs.
The gains land exactly where agentic workloads hurt the most.
Handle millions of tokens with context management at runtime.
With the highly efficient Subconscious Cache, save 10x on cost at scale.
Run 2.3x as many workloads on the same compute footprint.
3.5x faster token throughput down deep reasoning chains.
vLLM
Limited to the model context window with compaction. Long tasks run with unnecessary context or compact necessary information.
Subconscious
Processes millions of tokens per run. The system compresses its context at runtime, so accuracy stays high at any scale.
Try the API
Integrate in three lines of code.
To get started quickly, we serve our own TIM models + TIMRUN runtime behind OpenAI and Anthropic compatible APIs.
from openai import OpenAI
client = OpenAI(
base_url="https://api.subconscious.dev/v1", # Step 1: point to our API
api_key="YOUR_API_KEY", # Step 2: add your API key
)
response = client.chat.completions.create(
model="subconscious/tim-qwen3.6-27b", # Step 3: use one of our hosted models
messages=[
{
"role": "user",
"content": "Write a landing page geared towards developers in Boston."
}
],
)
print(response.choices[0].message.content)from anthropic import Anthropic
client = Anthropic(
auth_token="YOUR_API_KEY",
base_url="https://api.subconscious.dev",
)
message = client.messages.create(
model="subconscious/tim-qwen3.6-27b",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "Write a landing page geared towards developers in Boston."
},
],
)
print(message.content[0].text)import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://api.subconscious.dev/v1',
apiKey: 'YOUR_API_KEY',
});
const response = await client.chat.completions.create({
model: 'subconscious/tim-qwen3.6-27b',
messages: [
{
role: 'user',
content: 'Write a landing page geared towards developers in Boston.'
},
],
});
console.log(response.choices[0].message.content);import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic({
authToken: 'YOUR_API_KEY',
baseURL: 'https://api.subconscious.dev',
});
const message = await client.messages.create({
model: 'subconscious/tim-qwen3.6-27b',
max_tokens: 1024,
messages: [
{
role: 'user',
content: 'Write a landing page geared towards developers in Boston.'
},
],
});
console.log(message.content[0].text);curl https://api.subconscious.dev/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "subconscious/tim-qwen3.6-27b",
"messages": [
{
"role": "user",
"content": "Write a landing page geared towards developers in Boston."
}
]
}'Works anywhere you use the OpenAI completions or Anthropic messages APIs including LangChain, Mastra, Agno, Vercel AI SDK, n8n, OpenCode, Claude Code, Codex, Pi, OpenHands, and many more.
Questions
Frequently asked questions
- TIMRUN is our specialized inference runtime designed for agent workloads. It caches and compresses tokens aggressively during processing directly on the GPU. As a result, our system extends the context window of the models it serves by 10x, enables 2.3x concurrent workloads running on the same hardware, and sustains 3.5x faster token throughput where general-purpose runtimes slow down. TIMRUN can serve LLMs, SLMs, and multimodal modals.
- Most likely. TIMRUN is compatible with LLMs and Multimodal models. Our team has experimented extensively with the Qwen, GLM, Nemotron, and Kimi models, and many more open and closed models are compatible.
- Subconscious GLM-5.2 is the open-source GLM-5.2 model served on our TIMRUN runtime, built for agentic coding. It is available via an OpenAI and Anthropic compatible API.
- TIM-Qwen3.6-27B is our post-trained small language model running on our TIMRUN inference system. We took the already powerful Qwen3.6 27B model and significantly improved its capabilities with TIMRUN and our post-training process. We offer this system via an OpenAI completions and Anthropic messages compatible API.
- Yes. The API supports OpenAI chat completions and Anthropic messages formats. If you have code that already uses the OpenAI or Claude SDK, you can point it at our endpoint and try our system with 3 lines of code.
- Yes. Our API is compatible with any tool that uses the OpenAI chat completions or Anthropic messages format. Our documentation has pointers to get you started.
- Yes. Our API is compatible with any tool that uses the OpenAI chat completions or Anthropic messages format. Our documentation has pointers to get you started.
- Yes. Any framework that uses the OpenAI completions or Anthropic API works with Subconscious. Swap in our base URL and API key and you are up and running.
- Yes. TIMRUN compresses context aggressively without losing reasoning quality, which changes the math on edge AI. With TIMRUN, the same device can run a larger model, complete longer context tasks, or do work that simply was not possible before. We are currently running on workstations like the Nvidia DGX Spark, laptops, and even mobile devices like iPhones and Samsung Galaxy phones. Sign up for our platform and head to the local devices tab to learn more.
- Yes. We open source our post-trained models on Hugging Face. We do not, however, open source our proprietary inference runtime TIMRUN.
- Yes. We offer dedicated GPU infrastructure with no rate limits, optional post-training on your tools and data, and custom SLAs. Sign up for our platform and head to the dedicated endpoint tab to get started.
- No. Subconscious is a runtime optimization and post-training company. We develop our TIMRUN runtime and TIM family of post-trained models. We take open models and post-train them to improve their reasoning ability on policy with our TIMRUN runtime. For specific customers, we help them post-train models for their unique data and tooling.
Get started
Make your GPUs go further.
Same models, same hardware, more agentic workloads per GPU. Run Subconscious where your agentic workloads already live.
2.3×
Concurrent workloads on the same hardware
3.5×
Faster token throughput with long context
10×
Context window extension