On this page · 14 sections
- Why the bill is rising, and why AI changed it
- 1. Commit to the right discount
- 2. Kill idle resources and rightsize
- 3. Use spot and preemptible capacity for the right work
- 4. Cut the AI bill at the model and token layer
- 5. Tame storage, data transfer, and egress
- 6. Build the FinOps practice itself
- Commitment discounts across the three clouds
- The AI cost levers worth knowing
- What it means for India
- A 90-day FinOps starting sequence
- FAQ
- How eCorpIT can help
- References
Summary. Cloud bills are rising and AI is the reason. Companies waste about 27% of cloud spend, more than $100 billion globally in 2026, with idle compute and over-provisioned instances the biggest culprits, and CloudZero measures the average estate running near 35% waste even as FinOps matures. AI has changed the shape of the problem: the FinOps Foundation's State of FinOps 2026 survey of more than 1,200 organisations found 98% now manage AI spend, up from 31% two years ago, and ranks FinOps for AI the top forward-looking priority. The stakes are high, because in Gartner's words "cost is as big an AI risk as security," and the firm warns that misjudging how generative AI costs scale can produce a 500% to 1,000% error. For Indian teams the pressure is sharper, with public cloud spending set to reach $17.5 billion in 2026, up about 28% in a year. This guide sets out six FinOps ways Indian engineering teams cut their AWS, Azure, and Google Cloud bills in 2026, the saving each delivers, and the AI-specific layer most teams are still missing. For broader cloud-bill tactics, see our guide on how Indian companies cut cloud costs.
FinOps used to mean trimming idle servers. In 2026 it means that plus a new and faster-growing line: the cost of running AI. The six moves below start with the proven cloud levers, then add the AI-specific ones, because a team that optimises virtual machines while ignoring its token bill is fixing the wrong half of the invoice.
Why the bill is rising, and why AI changed it
Two numbers frame the problem. Public cloud spending is heading past $1 trillion globally in 2026, and waste runs between 32% and 40% in organisations without a FinOps practice, falling to 15% to 20% in mature ones. The gap between those two ranges is the prize.
AI reshaped the picture because its costs behave differently from a virtual machine. GPU time is expensive and scarce, model inference is billed per token, and usage can spike with a single popular feature. J.R. Storment, executive director of the FinOps Foundation, frames the shift directly: "As companies pursue transformation via AI, with the resulting increases in AI costs, FinOps practices will be critical to enable c-level decisions about multi-year strategic technology investments across infrastructure types." The practical reading for an engineering lead is that the old cost playbook still matters, and there is now a second playbook for AI sitting on top of it.
1. Commit to the right discount
The single largest lever is committing to spend in exchange for a discount. AWS Reserved Instances reach up to 72% off, Google Cloud committed use discounts up to about 70%, and Azure reservations up to roughly 65%, with the more flexible Savings Plans giving 20% to 50% depending on term and provider. The trap is over-committing to the wrong shape, so the discipline is to commit only to your stable baseline and leave the variable layer on demand or spot.
The opportunity for most teams is coverage. The median enterprise covers only 55% to 65% of compute with commitments, while the best-optimised teams reach 70% to 80%. Closing that gap is often the fastest large saving available, and it requires no code change, only an accurate read of your steady-state usage.
2. Kill idle resources and rightsize
The cheapest saving is switching off what you are not using. Idle compute is the biggest single source of waste at around 35%, and over-provisioned instances add another 25%, which means more than half of wasted spend comes from machines that are either doing nothing or are larger than the job needs. Rightsizing matches the instance to the real workload, and scheduling shuts non-production environments off nights and weekends.
These are the quick wins, and in the Indian market structured rightsizing and scheduling of non-production environments commonly return 15% to 20% within the first month. They need no commitment and no architecture change, which is why a FinOps programme should start here while the commitment analysis runs in parallel.
3. Use spot and preemptible capacity for the right work
Spot instances on AWS, preemptible and spot virtual machines on Google Cloud, and spot on Azure sell spare capacity at a steep discount, in exchange for the provider reclaiming it on short notice. That trade-off is wrong for a latency-sensitive customer-facing API and right for interruptible work: batch jobs, data pipelines, continuous integration, and crucially much of AI training and offline inference. Pairing on-demand or reserved capacity for steady services with spot for interruptible workloads captures a large discount without risking the user experience. For AI specifically, batch inference pipelines run well on spot, while a synchronous API needs reserved capacity.
4. Cut the AI bill at the model and token layer
This is the layer most teams miss, and it is where AI spend is won or lost. Inference cost falls across four steps applied in order: change the model, optimise the runtime, match the infrastructure, then monitor continuously. On the model side, FP8 quantisation on modern GPUs delivers 1.3 to 2 times the throughput of FP16 with under 2% quality loss on instruction-tuned models, which is a direct cut in GPU hours per request.
On the token side, the wins are larger still. Prompt compression, semantic caching, batch processing, and routing simple queries to cheaper models together cut large-language-model spend by 50% to 80%. Two facts drive the design: output tokens cost roughly four times input tokens, so shorter answers save more than shorter prompts, and tighter limits on retrieval-augmented generation can cut input tokens by more than half with no loss in precision. Where latency allows, a batch inference endpoint runs at about half the real-time token price. The control that holds it together is a weekly cost-per-million-tokens metric with alerts set at 80% of budget rather than 100%, so a runaway feature is caught with time to react.
5. Tame storage, data transfer, and egress
The bill is not only compute. Storage quietly accumulates as old snapshots, logs, and unused volumes pile up, and the fix is lifecycle policies that move cold data to cheaper tiers and delete what is past its retention. Data transfer is the sharper trap, because moving data out of a cloud, or between regions and availability zones, carries egress charges that surprise teams at scale. For an Indian company serving local users, keeping compute and data in the same region cuts both latency and transfer cost, and for AI workloads moving large training datasets, egress can quietly become a line item worth engineering around.
6. Build the FinOps practice itself
The first five moves are tactics. The sixth is the system that keeps them working: a FinOps practice. That means tagging every resource so cost can be attributed to a team or product, showback or chargeback so engineers see what they spend, automated anomaly detection so a cost spike raises an alert the same day, and a regular cadence where engineering and finance look at the numbers together. The evidence is stark: organisations without this discipline waste 32% to 40%, those with it waste 15% to 20%, and around 70% of large enterprises now run a dedicated FinOps team. The practice is what turns a one-time clean-up into a durable habit, and it is the difference between the two waste numbers.
| FinOps move | What it cuts | Typical saving |
|---|---|---|
| 1. Commit to discounts | Steady-state compute price | Up to 72% on committed use |
| 2. Kill idle and rightsize | Idle and oversized machines | 15-20% in the first month |
| 3. Spot and preemptible | Interruptible and batch work | Steep discount on spare capacity |
| 4. Optimise model and tokens | GPU hours and LLM token spend | 50-80% on AI inference |
| 5. Storage and egress | Cold data and data transfer | Lifecycle and region savings |
| 6. FinOps practice | Untracked, unowned spend | 32-40% waste down to 15-20% |
Commitment discounts across the three clouds
The commitment models differ by provider, so a multi-cloud Indian team has to read each one. The headline rates are similar, but the flexibility and the lock-in are not.
| Cloud and model | What it suits | Indicative discount |
|---|---|---|
| AWS Reserved Instances | Predictable, fixed instance family | Up to 72% |
| AWS Savings Plans | Flexible compute commitment | 20-45% |
| Azure Reservations | Predictable virtual machines | 20-42% |
| Azure Savings Plans | Flexible compute commitment | 20-50% |
| Google Cloud committed use | Stable long-term usage | Up to 70% |
The AI cost levers worth knowing
These are the moves that separate a controlled AI bill from a runaway one, and most predate any vendor tool.
| AI cost lever | What it does | Reported impact |
|---|---|---|
| FP8 quantisation | Runs the model in lower precision | 1.3-2x throughput, under 2% quality loss |
| Semantic caching | Reuses answers to repeated queries | Part of a 50-80% LLM spend cut |
| Batch inference endpoint | Trades latency for a lower rate | About 50% of real-time token cost |
| Tighter RAG token caps | Sends less context per request | Cuts input tokens by over half |
| Spot GPUs for training | Uses spare GPU capacity | Steep discount on interruptible runs |
What it means for India
The Indian context raises the stakes. End-user public cloud spending in India is set to reach $17.5 billion in 2026, up about 28% from $13.7 billion in 2025, driven heavily by AI infrastructure demand, and roughly 85% of Indian enterprises already use two or more public-cloud providers. That multi-cloud reality means the commitment and discount work in move one has to be done three times, once per provider, with no single bill to optimise.
The common failure is local but not unique: many Indian small and mid-size firms complete a cloud migration and then skip the financial-optimisation phase, leaving 20% to 40% of spend unnecessary. With the rupee adding currency risk to a dollar-denominated cloud bill, that waste is more expensive than the raw percentage suggests. The practical path is to start with the quick wins in moves two and five for an immediate 15% to 20%, run the commitment analysis in move one against your steady-state usage, and stand up the AI cost controls in move four before, not after, an AI feature scales. The same cost discipline underpins any serious enterprise generative AI strategy, because an AI product with no unit-cost model is a budget risk waiting to surface.
A 90-day FinOps starting sequence
A FinOps programme works best as a sequence, not a big bang. A practical first quarter looks like this.
Weeks 1 to 4: see the spend. Turn on cost and usage reporting, tag every major resource by team and product, and build one dashboard that shows where the money goes. Most teams find their first surprise here, an idle cluster, a forgotten environment, or a logging bill no one owned. Switch off the obvious idle resources and schedule non-production environments to stop nights and weekends, which alone tends to return 15% to 20%.
Weeks 5 to 8: commit and rightsize. With a month of clean usage data, separate steady-state workloads from variable ones, then buy commitments against the steady baseline only, lifting coverage toward the 70% to 80% the best-optimised teams reach. Rightsize the over-provisioned instances the dashboard exposed, and move interruptible and batch work, including AI training, onto spot capacity.
Weeks 9 to 12: control the AI bill. Instrument token usage, compute a cost-per-million-tokens metric, and set budget alerts at 80%. Add semantic caching and prompt compression, route simple queries to cheaper models, and tighten retrieval context. Quantise where quality allows. By the end of the quarter the team has a measured before-and-after and, more importantly, the tagging, alerts, and review cadence that keep the savings from leaking back.
The discipline that makes this stick is the review itself: a standing session where engineering and finance read the same numbers, so a cost spike surfaces as a shared problem to solve rather than a line discovered in a month-end invoice. The tools change every year. The habit is what holds.
FAQ
How eCorpIT can help
eCorpIT is a CMMI Level 5 technology organisation in Gurugram whose senior engineering teams run FinOps for cloud and AI workloads on AWS, Azure, and Google Cloud. We find the quick wins first, set commitment coverage against your real baseline, build the tagging and anomaly detection that keep spend visible, and apply the model, token, and GPU optimisations that control an AI bill before it scales. You can read more about eCorpIT and its director Manu Shukla. To scope a cloud and AI cost review, contact our team.
References
_Last updated: 21 June 2026._