On this page · 16 sections
- Why you need cost tooling, not just discipline
- 1. Langfuse: open-source observability and cost tracking
- 2. Helicone: the fastest way to start tracking cost
- 3. LiteLLM: the gateway with budget controls
- 4. Portkey: a gateway with caching and guardrails
- 5. OpenRouter: managed access and model arbitrage
- 6. GPTCache: semantic caching to stop paying twice
- 7. LLMLingua: prompt compression for long contexts
- How to combine them into a stack
- What to track once the tools are in
- Three more tools worth knowing
- A worked example: cutting a RAG bill
- What it means for India
- FAQ
- How eCorpIT can help
- References
Summary. LLM spend is the line on the cloud bill that grows fastest and hides best, and the right tools turn it from a surprise into a number you control. Strategic optimisation can cut LLM costs by 60% to 80% while holding or improving quality, and most of that comes from four moves the tools below automate: see the spend, route it through a gateway, cache repeated answers, and compress prompts. The categories are mature in 2026. Observability platforms such as Langfuse, which is open-source, and Helicone track cost per request automatically, while LangSmith, from $39 per user a month, ties into LangChain. AI gateways such as LiteLLM and Portkey put budget caps and routing in front of every call, with Portkey users reporting 30% to 50% savings from semantic caching alone. Caching tools such as GPTCache reach 30% to 60% hit rates on repetitive traffic, and prompt-compression tools such as LLMLingua shrink prompts up to 20 times. As Gartner's Mary Mesaglio puts it, "cost is as big an AI risk as security." This guide covers seven tools every engineering team should know, what each one cuts, and how to combine them. For the FinOps method behind the tools, see our guide on cutting AI cloud bills.
Tools named here are examples of mature options in each category, not endorsements; the right pick depends on your stack, your spend, and whether you need self-hosting for compliance. Verify current pricing and features before you commit, because this market moves fast.
Why you need cost tooling, not just discipline
Good intentions do not lower a token bill. LLM cost is generated by code making API calls thousands of times a second, so it has to be measured and controlled in code, not in a spreadsheet after the fact. That is what these tools do: they sit in the path of every model call to record what it cost, enforce a budget, reuse an answer, or shrink a prompt before it is sent.
The categories stack into four layers, and most teams adopt them in this order: observability to see the spend, a gateway to control and route it, caching to avoid paying twice for the same answer, and compression to send fewer tokens. A team that adds all four typically reaches the 60% to 80% savings the research describes. The seven tools below populate those layers.
1. Langfuse: open-source observability and cost tracking
You cannot cut what you cannot see, and Langfuse is the open-source default for seeing it. It traces every LLM call and calculates cost automatically from the model and token counts, with predefined pricing for OpenAI, Anthropic, and Google models and granular tracking of input, output, cached, audio, and image tokens. Because it is MIT-licensed, you can self-host it, which matters for teams that cannot send prompt data to a third party. Langfuse is the tool that turns "our AI bill went up" into "this feature, on this model, is the cost."
2. Helicone: the fastest way to start tracking cost
Where Langfuse is feature-rich, Helicone is fast to adopt. It works as a proxy, so you change a base URL and immediately get automatic cost tracking with no SDK rewrite, and gateways like it typically add zero markup on the underlying LLM cost. For a team that wants logging and cost visibility in minutes rather than a day, Helicone is the lowest-friction starting point, and it suits smaller, simpler workloads well before a more involved platform is justified.
3. LiteLLM: the gateway with budget controls
LiteLLM is the open-source AI gateway most self-hosted teams reach for. It exposes more than 100 providers behind a single OpenAI-compatible API, so your code targets one interface while you switch models underneath, and it ships the cost controls that matter: virtual keys, budget caps, and automatic fallbacks. It runs free on any server with low overhead, around 8 milliseconds of added latency at a thousand requests a second. LiteLLM is where you enforce a hard spend limit per team or feature, so a runaway loop cannot run up an unbounded bill.
4. Portkey: a gateway with caching and guardrails
Portkey occupies the production-guardrails end of the gateway category. Alongside routing and budget caps, it adds semantic caching, PII redaction, and jailbreak detection, and companies using it commonly report 30% to 50% reductions in LLM cost from caching alone. In March 2026 Portkey open-sourced its gateway under Apache 2.0, so the core routing and guardrails can be self-hosted without the managed platform. For teams that want cost control and safety controls in one layer, it is a strong fit.
5. OpenRouter: managed access and model arbitrage
OpenRouter is the zero-ops option. It gives instant access to more than 300 models through one API with no infrastructure to run, charging the provider's per-token rate plus a small platform fee. Its cost value is arbitrage and flexibility: you can route a request to whichever model offers the best price for the quality you need, and switch as prices change, without integrating each provider yourself. The trade-off is less control, since it does not offer the hard budget caps that LiteLLM and Portkey do, so pair it with observability to watch spend.
6. GPTCache: semantic caching to stop paying twice
The cheapest call is the one you never make, and GPTCache is the mature open-source way to avoid it. It is a semantic cache, matching queries by meaning rather than exact text, so "what is your return policy" and "how do I return a product" hit the same cached answer. It integrates with LangChain and supports multiple embedding models, vector stores, and cache backends, and on traffic with high query overlap it reaches cache hit rates of 30% to 60%. Every hit is a model call avoided entirely, which is why caching is often the single largest saving in a customer-facing application.
7. LLMLingua: prompt compression for long contexts
When you must make the call, send fewer tokens. LLMLingua is a prompt-compression tool that shrinks prompts by 5 to 20 times while preserving their meaning, which is most valuable for retrieval-augmented generation, where long retrieved contexts dominate the token count. Because input tokens are billed on every call, cutting context length is a direct, recurring saving, and LLMLingua does it programmatically rather than by hand. For a RAG system on a tight budget, it is one of the highest-use tools available.
| Tool | Category | What it cuts | Licensing |
|---|---|---|---|
| Langfuse | Observability | Blind spend; finds the costly feature | Open-source (MIT) |
| Helicone | Observability proxy | Setup time to start tracking | Free tier, proxy |
| LiteLLM | AI gateway | Runaway spend, via budget caps | Open-source |
| Portkey | Gateway plus caching | 30-50% via semantic caching | Open-source core |
| OpenRouter | Managed gateway | Per-call cost via model arbitrage | Pay per token |
| GPTCache | Semantic caching | 30-60% of repeated calls | Open-source |
| LLMLingua | Prompt compression | Input tokens, 5-20x on long prompts | Open-source |
How to combine them into a stack
The tools are layers, not alternatives, and they compound. A sensible stack starts with observability, because you cannot prioritise what you cannot measure, then adds a gateway for control, then caching and compression for the largest savings. Each layer multiplies the next: caching removes calls, compression shrinks the calls that remain, and the gateway caps whatever is left, while observability tells you whether any of it is working.
| Layer | Tool examples | Typical effect |
|---|---|---|
| 1. Observe | Langfuse, Helicone, LangSmith | Find the costly features and models |
| 2. Gateway | LiteLLM, Portkey, OpenRouter | Budget caps, routing, one API |
| 3. Cache | GPTCache, Redis vector cache | 30-60% of repeated calls removed |
| 4. Compress | LLMLingua | 5-20x smaller prompts on long context |
| 5. Route | OpenRouter, RouteLLM | Cheaper model for the same quality |
Adopt the layers in order and measure after each, rather than installing everything at once. Observability first tells you where the money goes; often a single feature on an expensive model is most of the bill, and fixing that one thing pays for the whole effort.
What to track once the tools are in
The tools produce data; the discipline is acting on it. Track cost per million tokens as a weekly number, cache hit rate as the health metric for your caching layer, and cost per request broken down by feature and by model, because that breakdown is where the expensive surprises hide. Set budget alerts at 80% rather than 100%, so a spike is caught with time to react, and review the numbers with the team on a regular cadence. The same outcome-and-quality discipline that governs any enterprise generative AI strategy applies here: a saving that quietly lowers answer quality is not a saving.
Three more tools worth knowing
The seven above cover the core, but three more deserve a place on the radar. RouteLLM is a model router: it learns which queries a cheap model can handle and which need an expensive one, then sends each request to the lowest-cost model that will do the job, capturing the routing saving automatically rather than by hand-written rules. For teams with a wide quality range across their traffic, it can cut cost without a visible quality drop.
LangSmith is the observability choice for teams built on LangChain or LangGraph, where its tight integration shows exactly where an agent chain spends its tokens and its time. It starts around $39 per user a month, so it is a paid pick, but for a deep LangChain investment the integration earns it.
Cloudflare AI Gateway is the managed counterpart to a self-hosted gateway, using Cloudflare's edge network for caching and analytics with no infrastructure of your own to run. It trades the fine control of LiteLLM or Portkey for zero operations. Redis, with vector search, is the other common semantic-cache backend alongside GPTCache, and many teams already run Redis, which makes it a low-friction place to add caching. The point of naming these is that each core category has more than one credible option, so the right tool is the one that fits your existing stack and your constraints, not a single winner.
A worked example: cutting a RAG bill
To make the stack concrete, take a common case: a retrieval-augmented chatbot answering customer questions, whose bill has crept up as usage grew. The fix follows the four layers in order.
First, observe. Adding Langfuse reveals that one endpoint, the support chat, is most of the spend, and that it runs on a top-tier model with long retrieved contexts. That single finding tells you where to act, and without it you would optimise blindly.
Second, gateway and route. Putting LiteLLM in front lets you set a hard monthly budget cap on the support feature so it cannot run away, and lets you route the simpler questions, the order-status and policy queries, to a cheaper model while keeping the expensive one only for genuinely hard answers. Even a rough split removes a large share of the premium-model calls.
Third, cache. A support chatbot gets the same questions constantly, so semantic caching with GPTCache earns a high hit rate, often in the 30% to 60% range. Every cached answer is a model call that does not happen at all.
Fourth, compress. The long retrieved contexts that RAG sends are mostly billable input tokens, so running them through LLMLingua to compress the context several times over cuts the cost of every call that does reach a model, with little effect on answer quality when tuned.
Stacked, these layers commonly land in the 60% to 80% total saving the research describes, on a bill that was rising before. None of the four is exotic, and all of the tools named are available today, most of them open-source. The work is adopting them deliberately, measuring after each layer, rather than reaching for one tool and hoping.
What it means for India
For India's fast-growing community of AI startups and engineering teams, cost tooling is not optional, because an LLM bill is dollar-denominated while revenue is often in rupees, so token spend carries currency risk on top of its raw cost. The good news is that the strongest tools in this list are open-source and self-hostable: Langfuse, LiteLLM, GPTCache, and LLMLingua all run free on your own infrastructure, which suits cost-sensitive teams and keeps prompt data in your control for any personal data covered by India's Digital Personal Data Protection rules.
The practical path for an Indian team is to start with open-source observability to find where the money goes, add a self-hosted gateway with budget caps so no feature can run away, then layer caching and compression where the traffic justifies it. That sequence delivers most of the 60% to 80% saving with little or no tooling spend, which is exactly the economics a young company needs. The tools are mature and free; the work is adopting them before the bill, not after.
FAQ
How eCorpIT can help
eCorpIT is a CMMI Level 5 technology organisation in Gurugram whose senior engineering teams build and run cost-controlled AI systems. We instrument LLM spend with observability, put a gateway with budget caps in front of every call, add semantic caching and prompt compression where the traffic justifies it, and set up the weekly cost metrics that keep a bill predictable, using open-source tools where data sensitivity or budget requires it. You can read more about eCorpIT and its director Manu Shukla. To get your LLM costs under control, contact our team.
References
_Last updated: 21 June 2026._