On this page · 9 sections
- Why agents break differently from other software
- Lesson 1: reliability is a systems problem, not a model problem
- Lesson 2: the bottleneck is context, tools, and data, not the model
- Lesson 3: security and human oversight are the product, not a feature
- India-specific considerations
- What to do before you ship
- FAQ
- How eCorpIT can help
- References
Summary. Gartner expects over 40% of agentic AI projects to be canceled by the end of 2027, even as it forecasts task-specific agents in 40% of enterprise apps by the end of 2026, up from under 5% in 2025. The gap between those two numbers is an engineering gap. A March 2026 study of 6,259 production agents found a 56.6% aggregate success rate, and enterprise tests show an agent that succeeds 60% of the time on a single run falls to about 25% across eight runs. The global agentic AI market sits near $9 billion to $10.9 billion in 2026, up from roughly $7.29 billion in 2025, and in India, SAP projects agentic AI returns reaching $14.4 million, a fivefold rise. Yet only 21% of organisations have a mature governance model for autonomous agents, and 52% name data quality as their biggest blocker. After shipping agents into production, three lessons separate the projects that survive from the 40% that get canceled. This is what they are.
The honest starting point is that an agent demo and a shipped agent are different products. A demo proves an agent can complete a task once, under clean conditions. Production asks whether it works the thousandth time, on a flaky API, with an ambiguous request, against an adversarial input, inside a budget. Gartner's Anushree Verma, Senior Director Analyst, put the current state plainly: "Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied." The three lessons below are the engineering reality behind that sentence, drawn from what actually breaks when agents leave the lab. They are written for founders, engineering leaders, and delivery teams who have to ship, not just prototype.
| The reality in numbers | Figure | What it means |
|---|---|---|
| Agentic projects canceled by 2027 | Over 40% (Gartner) | Most starts do not reach durable production |
| Production agent success rate | 56.6% across 6,259 agents | Roughly half of real runs fall short |
| Reliability decay | 60% on one run, 25% over eight | Errors compound across multi-step tasks |
| Lab-to-production gap | 37% | Benchmark scores overstate real performance |
| Mature agent governance | Only 21% of organisations | Oversight lags deployment badly |
Why agents break differently from other software
Before the lessons, it helps to name why agents fail in ways ordinary software does not. Conventional software is deterministic: the same input gives the same output, so a test that passes today passes tomorrow. An agent is probabilistic and multi-step. It chooses an action, calls a tool, reads the result, and chooses again, and a small error at step two reshapes everything after it. That is why single-run success and repeated-run success diverge so sharply, from 60% to about 25% over eight runs. It is also why a passing demo proves so little: one clean run samples the best case, while production samples the whole distribution, including the bad tail. Add the outside world, flaky APIs, rate limits, malformed data, ambiguous human phrasing, and adversarial inputs, and the 37% gap between benchmark and production performance stops being surprising. The practical consequence is that you cannot test an agent the way you test a function. You have to test it the way you test a distributed system under load and failure: many runs, fault injection, and metrics on the path taken, not only the final answer. All three lessons follow from that one fact. An agent is a system, and it has to be engineered like one.
Lesson 1: reliability is a systems problem, not a model problem
The first thing production teaches is that a more capable model does not fix an unreliable agent. The reliability collapse is structural. An agent that completes a task 60% of the time on a single attempt can drop to around 25% across eight runs, because errors compound at every step of a multi-step workflow. Across 6,259 real agents handling customer service, document processing, and workflow automation, the aggregate success rate was 56.6%. And there is a roughly 37% gap between benchmark scores and real-world performance, because benchmarks use clean inputs and predictable tools while production brings ambiguous requests, flaky APIs, rate limits, and odd data formats.
Treat reliability as something you engineer around the model, not a property you hope the model has. That means evaluation at three levels: end-to-end, did the task succeed; trajectory-level, was the path efficient and sound; and component-level, which tool or sub-agent broke. It means bounded retries instead of infinite loops, since a common production failure is an agent retrying a failed API call until it burns the budget, sometimes costing 50 times more per task than planned. Watch the cost line as closely as the success line; the tooling in our guide to measuring LLM costs exists precisely because agents overspend quietly. Be sceptical of benchmark scores too: UC Berkeley researchers showed in April 2026 that every major agent benchmark can be gamed to near-perfect scores without solving a single task, a trap called corrupt success, where the agent reaches the right end state by an unsafe or illogical path.
The deeper move is to push as much determinism around the model as the task allows. Where a step can be a plain function call, a database lookup, or a rule, make it one, and reserve the model for the genuinely open-ended parts. Constrain outputs to schemas the rest of the system can parse, validate every tool call before it runs, and make each step idempotent so a retry cannot do damage twice. The goal is an agent whose non-deterministic surface is as small as the work allows, because every place you replace a guess with a guarantee is a place that stops failing at scale.
Lesson 2: the bottleneck is context, tools, and data, not the model
The second lesson is where teams waste the most time before they learn it. When an agent fails in production, the cause is usually not that the model was not smart enough. It is that the model was handed the wrong context, an unreliable tool, or dirty data. In one survey, 52% of organisations named data quality as the single biggest blocker to deploying agents. Agents fail by hallucinating a tool call that does not exist, by looping on an API that returns an unexpected format, or by acting on stale context. These are integration and data problems wearing an AI costume.
Take a common pattern. An agent asked to reconcile invoices looks capable against three clean PDFs in a demo, then fails in production because real invoices arrive as scanned images in several formats with missing fields, and the reading tool returns an empty string instead of an error. The model did nothing wrong; the pipeline around it did. The fix is rarely a better model. It is better extraction, tool contracts that fail loudly rather than silently, and a validation step that routes low-confidence reads to a human. Solve those and the same model that looked unreliable starts to hold.
The engineering implication is to spend your effort on the scaffolding, not the prompt. Invest in clean, well-described tools with predictable failure behaviour, in retrieval that gives the agent the right context at the right step, and in data quality before autonomy. The Model Context Protocol and similar standards help by giving agents a consistent way to reach tools, but a standard interface to a bad tool is still a bad tool. There is also a buying lesson here: Gartner estimates only about 130 of the thousands of vendors claiming agentic products are real, with the rest engaging in agent washing, rebranding chatbots and robotic process automation as agents. The plain version we tell clients is that the model is the easy 20% of an agent; the other 80% is the context, tools, and data plumbing around it, and that is where projects are won or lost.
Lesson 3: security and human oversight are the product, not a feature
The third lesson is the one that turns a clever pilot into something an enterprise can actually run. Agents act with real credentials and real tools, which makes them a new attack surface. The dominant threat is prompt injection: an attacker hides instructions in a document, an email, or an API response, and the agent executes them using the access it already has. It does not need to break in; it needs to convince the agent to misuse a tool it can already reach. Security scans found that 36.82% of nearly 4,000 scanned agent skills had at least one security flaw, including arbitrary file read, remote code execution, and tool poisoning, some in widely used official servers.
Most organisations are not ready for this. A 2026 survey found 41% to 44% had not implemented basic human-in-the-loop oversight, and 55% to 63% lacked purpose binding, kill switches, or network isolation. The fix is defence in depth, because model-level resistance to injection is a layer, not a solution. Pair it with architectural controls: tool allowlisting, identity binding so an agent acts only as a scoped principal, runtime monitoring, an MCP gateway that governs which tools each identity can call with a full audit trail, and human checkpoints on high-privilege actions. Keep an inventory of every production agent with its permission scopes, and run adversarial testing before launch; four-week adversarial reviews are becoming standard for high-privilege deployments. A useful first step most teams skip is the inventory itself: a list of every production agent with its permission scopes, data access, and tool authorisations. Most organisations cannot produce that list, and the act of building it tends to surface the highest-risk exposures, the agent with broad write access nobody remembered granting, or the tool reachable by an identity that should never have had it. Pair the inventory with a gateway that decides which tools each identity may call and logs every invocation, and you turn a sprawl of agent-tool connections into something you can actually audit. Human oversight is not a sign of immaturity here. It is the control that caps the blast radius when an agent goes wrong, which, over enough runs, it eventually will.
| Security control | Organisations lacking it | Why it matters |
|---|---|---|
| Human-in-the-loop oversight | 41% to 44% | No checkpoint on high-risk actions |
| Kill switches and isolation | 55% to 63% | No way to contain a misbehaving agent |
| Agent skill safety | 36.82% of skills flawed | Injection, RCE, and tool poisoning risk |
| Mature governance model | Only 21% have one | Oversight lags deployment |
India-specific considerations
The Indian market is scaling quickly and on the same fault lines. SAP's 2026 report projects Indian enterprises' agentic AI returns reaching $14.4 million, a fivefold increase, with $25.9 million in committed spending and 67% of businesses piloting agentic use cases. Bain expects Indian enterprise technology spending to rise 6% to 8% in 2026, with a large share of change budgets going to AI and data-led work. The same three lessons apply, and one carries extra weight here: data and security governance under the Digital Personal Data Protection Act 2023 and the DPDP Rules 2025. An agent that reads or writes personal data is a data-processing activity, so consent, purpose limitation, and breach duties attach to it. An Indian enterprise shipping an agent has to know exactly which personal data the agent can touch, under what consent, and with what audit trail, which is the same inventory discipline that good agent security requires anyway. The cost lesson matters too: with returns still measured in single-digit millions of dollars, an agent that overspends 50 times its budget erases the business case fast. For the broader plan, our note on generative AI enterprise strategy for 2026 covers how to put this governance in place.
What to do before you ship
Turn the three lessons into a short pre-launch checklist. Build evaluation at the end-to-end, trajectory, and component levels, and gate launch on multi-run reliability, not a single good demo. Harden the context, tools, and data before you add autonomy, and prefer fewer, well-described tools over many flaky ones. Put security and oversight in from the start: scoped identities, tool allowlisting, an audit trail, human checkpoints on high-privilege actions, and an adversarial review. Track cost per task as a first-class metric so a looping agent cannot quietly drain the budget. None of this is glamorous, and all of it is what separates an agent that ships from one that joins the 40% Gartner expects to be canceled. The teams that win in 2026 treat an agent as a production system to be engineered, not a model to be prompted. Start narrow, on one workflow you can measure, prove reliability and safety there, and earn the right to expand rather than launching broad and hoping.
FAQ
How eCorpIT can help
eCorpIT is a CMMI Level 5, senior-led engineering organisation in Gurugram that builds and ships enterprise AI agents. We bring the production discipline these lessons demand: multi-level evaluation harnesses, hardened tool and data integration, and a security posture of scoped identities, allowlisting, audit trails, and human checkpoints, designed to meet DPDP requirements. If you are moving an AI agent from pilot to production and want it to survive contact with real users, talk to our team or read more about how we work.
References
_Last updated: 25 June 2026._