Better models are not enough. The next AI race is about doing the same work with fewer tokens, fewer loops, and less waste.
Why token efficiency is becoming the real bottleneck
For a long time, the AI industry had one clean answer: make the model stronger.
That answer still matters. But it no longer solves the whole problem.
A stronger model can solve harder tasks. But if every task needs a long prompt, ten tool calls, five retries, and a giant context window, the real cost still climbs fast. OpenAI explains that API usage can include input tokens, output tokens, cached tokens, and reasoning tokens. That means users are not only paying for the visible answer. They may also pay for the process that gets the model there.
In short, the industry does not just need smarter models. It needs models and agents that waste fewer tokens.
What did Luo Fuli get right about token cost?
Luo Fuli’s argument about token cost was sharp because she did not stop at “tokens are expensive.” That is obvious now.
Her deeper point was about mismatch. Global compute supply is not growing fast enough to match the token demand created by agents. In a recent AI builder discussion reported by 36Kr, several speakers made the same point: agent systems may raise human productivity by 10 times, but the compute demand behind them can grow by 100 times. Luo Fuli’s view was that limited compute needs to be stretched further through more efficient model architecture and more efficient long context inference.
That sounds technical, but the business meaning is simple.
- The model side must become more efficient. The same token budget should finish more work.
- The agent side must become less wasteful. The same task should need fewer calls and shorter context.
- The infrastructure side must become more transparent. Teams need to know where tokens are burned before the monthly bill arrives.
I agree with this more after seeing real agent workflows. Some open source agent frameworks look very professional. But in practice, they sometimes behave like a meeting that should have been an email. A task that should take two or three model calls gets stretched into more than ten calls. Every call brings a long prompt. Every step sounds logical. The final bill does not.

Agents are token amplifiers
A chatbot spends tokens once. An agent spends tokens in loops.
It plans. It calls tools. It reads results. It corrects itself. It plans again. Then it repeats.
That loop is powerful, but it is also dangerous. Once the loop becomes loose, the token cost can grow much faster than the task value.
A recent paper on agentic coding tasks studied trajectories from eight frontier models. The authors found that agentic tasks consumed about 1000 times more tokens than code reasoning and code chat. They also found that runs on the same task could differ by up to 30 times in total tokens. Even worse, higher token usage did not always bring higher accuracy.
| Workflow type | Token behavior | Real problem |
|---|---|---|
| Single chat | One request and one answer | Cost is easier to predict |
| Coding assistant | Several turns with context | Cost rises with files and retries |
| Agent workflow | Planning, tool calls, results, replanning | Cost can multiply fast |
| Poor agent loop | Repeated context and weak retries | More tokens do not always mean better output |
The key difference between a good agent and a wasteful agent is not how many steps it takes. It is how many steps actually move the task forward.
Bad agents feel like bad meetings
Token waste reminds me of low quality company meetings.
The meeting is scheduled for one hour. The real decision takes ten minutes. The rest is greeting, waiting for late people, repeating background, restating the same opinion in different words, drifting into side topics, and saving five minutes for someone to “summarize the spirit.”
We have all been in that meeting. We all hate that meeting.
Some agents work the same way.
They repeat context. They over explain plans. They call tools before narrowing the problem. They read long files when a short excerpt would work. They retry without learning much. They ask the model to think again when the task only needs a clean action.
More tokens can mean deeper thinking. But many times, more tokens just mean more wandering.
Why Elephant Alpha caught attention
This is why Elephant Alpha is interesting.
OpenRouter describes Elephant Alpha as a 100B parameter text model focused on intelligence efficiency. It supports a 256K context window, up to 32K output tokens, function calling, structured output, and prompt caching. OpenRouter says it is suited for code completion, debugging, rapid document processing, and lightweight agent interactions.
36Kr also reported that Elephant Alpha reached more than 185 billion tokens of invocation volume in less than 48 hours after appearing on OpenRouter. The report said its average speed on OpenRouter reached 67 tokens per second, with first token latency around 0.89 seconds.
We should not worship an anonymous model too quickly. We do not know the lab behind it. We do not know whether its early performance will survive heavy real world use. We do not know what its final pricing will look like.
But the direction matters.
Elephant Alpha is not only interesting because it may be fast. It is interesting because it treats intelligence per token as a product feature.
Cheaper tokens are not enough
Many teams think the answer is cheaper tokens.
That helps, but it is not the full answer. Buying cheaper tokens for a wasteful agent is like buying cheaper gasoline for a car with a leaking tank. You save a little on each unit, but the system still leaks.
Teams should ask harder questions.
- Why does this agent carry the full conversation history every time?
- Why does a simple extraction task need five paragraphs of instructions?
- Why does the workflow call a model again when a rule based check would work?
- Why does the agent retry the same bad plan instead of changing strategy?
- Why do we measure final output quality but ignore the tokens wasted before that output?
Token efficiency is not just cost control. It is product design.
A short, stable, accurate agent feels better than a dramatic agent that burns tokens to look smart. Users do not care how many internal speeches the system gave itself. Users care whether the job got done.
A practical way to think about token waste
Not every task deserves a huge context window. Not every task deserves a frontier model. Not every failure deserves three retries.
| Task type | Better strategy | Why it saves tokens |
|---|---|---|
| Simple rewriting | Use a smaller model | The task has low risk |
| Classification | Use a short prompt and strict output | The answer format is simple |
| Data extraction | Use schema and validation | The model does not need to explain |
| Coding task | Use a strong coding model only where needed | High value steps get better compute |
| Agent planning | Limit tools and context | The loop stays focused |
| Batch processing | Use cheaper routes and batch mode | Speed matters less than cost |
The smartest API teams will not be the teams that always use the cheapest model. They will be the teams that route each task to the right model with the right token budget.
How PP API fits into this problem
Once token efficiency becomes important, teams need two things: better routing and better visibility.
PP API fits this need because it gives teams one API for multiple model providers. The platform is designed as a unified large language model API gateway, with access to providers such as OpenAI, Anthropic, Google, DeepSeek, and Alibaba through one compatible format. It also supports smart routing, multi provider failover, pay as you go billing, no subscription fee, transparent model price comparison, OpenAI SDK compatibility, and some models priced as low as 70 percent of official pricing.
The direct cost saving angle is simple: same budget, more business volume. If a workflow can route easier tasks to lower cost models and reserve stronger models for high value steps, the same budget can run more work.
PP API also supports quick migration. Its Quick Start guide says developers can point the base URL to PP API, use a PP API key, and keep an OpenAI compatible Chat Completions format. The same guide shows that model switching only requires changing the model parameter, such as moving from GPT to DeepSeek, Qwen, or Gemini models.
For token management, visibility matters even more than slogans. PP API’s Dashboard shows model usage distribution, usage trends, request distribution, and API Key level filtering. It also supports hourly, daily, and weekly aggregation, and the dashboard usually updates within 1 minute.
Refuse bill shock. If token waste is the hidden cost of agents, teams need a minute level token dashboard to optimize every prompt, every loop, and every model choice.
FAQs
Why is token efficiency different from model intelligence?
Model intelligence measures what a model can solve. Token efficiency measures how many tokens it needs to solve the same task. A model can be smart and still expensive if it wastes tokens.
Why do agents consume so many more tokens than chatbots?
Agents work in loops. They plan, call tools, read outputs, revise plans, and repeat. A recent agentic coding study found that agentic tasks consumed about 1000 times more tokens than code reasoning and code chat.
Does using more tokens always improve accuracy?
No. The same study found that higher token usage did not always translate into higher accuracy. In many cases, accuracy peaked at an intermediate cost level and then saturated.
Why is Elephant Alpha worth watching?
Elephant Alpha is worth watching because it focuses on intelligence efficiency, not only model size. OpenRouter describes it as a 100B parameter model built to deliver strong performance while minimizing token usage.
How should teams reduce token waste in practice?
Teams should shorten prompts, avoid repeated context, cap retries, route tasks by value, monitor token use by key, and compare model costs before sending everything to the most expensive model.