On this page · 11 sections
- The production maturity gap
- Lesson 1: Narrow the scope until it ships
- Lesson 2: the harness is the product, not the model
- Lesson 3: Instrument from day one
- Lesson 4: Layer the guardrails, cap the spend
- Lesson 5: Keep a human where the cost of error is high
- India-specific considerations
- Putting the five together
- FAQ
- How eCorpIT can help
- References
Summary. Shipping an AI agent to production is mostly an engineering problem, not a model problem. The data is sobering: MIT found 95% of generative AI pilots drive no measurable result, Gartner expects more than 40% of agentic projects to be canceled by 2027, and LangChain's State of AI Agents survey of 1,340 respondents (18 November to 2 December 2025) found that while 57% of teams have agents in production and 89% have observability, only 52.4% run evaluations, and 32% name quality as the top barrier. Klarna's reversal, after its agent did the work of about 700 staff for roughly $40 million in benefit, then needed humans back for hard cases, is the cautionary tale every delivery team should know. At eCorpIT, founded in 2021 and assessed at CMMI Level 5, our agent delivery work points to the same five lessons the industry data now confirms: narrow the scope, engineer the harness, instrument from day one, layer the guardrails, and keep a human where errors are costly. Here is each, with the failure it prevents.
Andrew Ng, founder of DeepLearning.AI, called the shift early: "I think AI agentic workflows will drive massive AI progress this year, perhaps even more than the next generation of foundation models." He was right about the potential. What the past year taught delivery teams is that the gap between a demo and a dependable production agent is paved with unglamorous engineering. These are the five lessons that close it.
| Lesson | The failure it prevents | The control |
|---|---|---|
| Narrow the scope | Pilots that never reach P&L | One bounded workflow with a tracked metric |
| Engineer the harness | Brittle, unpredictable agents | Validated tool calls, layered architecture |
| Instrument from day one | Silent quality decay | Observability plus evals on real traces |
| Layer the guardrails | Runaway loops, denial-of-wallet | Five guardrail layers, budgets, least privilege |
| Keep a human in the loop | Costly errors in high-stakes tasks | Approval gates where error cost is high |
The production maturity gap
The market has moved from "can we build an agent" to "can we keep it reliable." LangChain's survey shows adoption is mainstream but maturity is not, and the gap between the two is exactly where these lessons live.
| Metric | Share | What it tells you |
|---|---|---|
| Teams with agents in production | 57% | Adoption is mainstream |
| Large enterprises (10K+ staff) in production | 67% | Big firms lead deployment |
| Teams with observability in place | 89% | Monitoring is near-universal |
| Teams running offline evaluations | 52.4% | Eval discipline lags badly |
| Teams naming quality the top barrier | 32% | Reliability is the real wall |
The shape of this data is the whole story. Almost everyone monitors, barely half evaluate, and a third are blocked by quality. The teams stuck in pilots are not short on models; they are short on the engineering habits below. Each lesson maps to one of these gaps.
Lesson 1: Narrow the scope until it ships
The single biggest predictor of an agent reaching production is a small, well-defined job. The 95% of pilots that fail to show measurable results, per MIT, almost always tried to automate too much at once. The agents that ship solve one bounded workflow with a metric you already track, so the before-and-after is undeniable.
We treat the first version of any agent as a scoping exercise, not a capability demo. Pick the workflow where the inputs are structured, the success metric exists, and the cost of a wrong answer is survivable. Prove it there, instrument the value, then expand to the adjacent workflow. The teams that try to boil the ocean join the 40% of agentic projects Gartner expects to be canceled by 2027. We expand on picking that first workflow in our notes on enterprise AI agent use cases that reached production.
Lesson 2: the harness is the product, not the model
Most agents fail in production not because the underlying model is weak, but because the harness around it is brittle, insecure or unpredictable. The lesson that took us the longest to internalise is to treat the agent as a software system first and an AI product second, applying version control, automated testing, deployment pipelines and SRE practices to every layer.
The concrete rule that prevents the most incidents: never let the model call tools directly. The model returns a structured tool call; the harness validates the schema, checks permissions, executes the action, and injects the result back. Forward-thinking teams, as production engineering guides put it, separate the reasoning, action-execution and validation layers so each can be tested and iterated independently. A model that proposes and a harness that disposes is far safer than a model with its hands directly on your systems. The mental model that helps most is to treat agents as distributed-systems components, with the same expectations of retries, idempotency, timeouts and graceful degradation you would demand of any service that calls other services. This separation is the same discipline behind platform engineering and modernization patterns we apply elsewhere.
Lesson 3: Instrument from day one
There is a revealing gap in the LangChain data: 89% of teams have observability, but only 52.4% run evaluations. Observability tells you what happened; evals tell you whether it was right. Both are needed, and the second is where most teams are behind.
We log every tool call and monitor every token from the first commit, because the alternative is debugging a non-deterministic system blind. Just as important, we build the evaluation set from real failure traces, not from handpicked prompts: last week's production failures become this week's regression tests. That practice is what catches silent quality decay, the kind that erodes an agent until users quietly stop trusting it. Given that 32% of teams in the LangChain survey name quality as their top production barrier, eval discipline is not optional polish; it is the difference between an agent that holds up and one that drifts.
Lesson 4: Layer the guardrails, cap the spend
A single agent failure mode can be expensive in a way traditional software is not. One flaky API can turn an agent into a self-inflicted denial-of-wallet attack, looping and burning tokens until someone notices the bill. The discipline that prevents this is defence in depth: stack independent guardrail layers so that if one fails, another catches the problem.
| Guardrail layer | What it catches | Example control |
|---|---|---|
| Input guards | Malicious or malformed requests | Validate and sanitise inputs |
| Tool and action gating | Unauthorised or risky actions | Per-tool approval, least privilege |
| Output guards | Unsafe or off-policy responses | Filter and validate outputs |
| Human approval | High-stakes state changes | Confirmation step before commit |
| Evals as feedback | Regressions over time | Continuous evaluation on traces |
The failure modes multiply once agents coordinate. Common ones include selecting the wrong cluster, environment or account; conflicting decisions between agents in a multi-agent system; and unauthorised state changes with no approval gate. Each is an argument for tighter action gating and least privilege, not for a more capable model. A more capable model with the same loose permissions simply fails faster and more confidently.
Beyond the layers, two hard limits earn their keep. Enforce timeouts and a maximum number of steps, on the order of no more than 12 tool calls or 90 seconds per task, to stop runaway loops, and set a per-task spend budget. Treat agent identity as first-class too: service principals, short-lived tokens and least-privilege scopes, because an agent inherits whatever permissions you grant it, and most incidents trace back to an agent that could do more than it should. Our work on enterprise AI agent governance layers goes deeper on this control plane.
Lesson 5: Keep a human where the cost of error is high
The most expensive lesson in the market belongs to Klarna. Its AI agent did the work of roughly 700 staff and was credited with about $40 million in annual benefit, then the company rehired humans for disputes, complex refunds and hardship cases after quality slipped. The durable design is hybrid: the agent handles routine, high-volume work, and a human owns the decisions where a wrong answer is costly.
This is not a temporary limitation to engineer away. The International AI Safety Report 2026, led by Yoshua Bengio and authored by more than 100 experts, concludes that agents complement rather than replace humans, and that the asymmetry between agent capability and human oversight is one of the most urgent challenges in AI safety. We design the human handoff before launch, not after a quality incident, and we put approval gates exactly where the cost of an error rises: money movement, irreversible changes, anything touching health or legal outcomes.
India-specific considerations
For Indian engineering teams, these lessons land slightly differently on the economics. The labour-cost arbitrage that made Klarna's headcount story dramatic is smaller where salaries are lower, so the strongest first-agent cases in India are usually internal: developer tooling, document processing and helpdesks, where value shows up as engineer hours and cycle time rather than rupees saved on staffing. The guardrail and least-privilege discipline matters more, not less, because agents that touch customer data fall under the DPDP regime, and our DPDP consent manager guidance applies to any agent that reads or routes personal data. Build the spend caps and audit logs in from the first sprint; retrofitting them after a budget surprise or a data-protection query is far more expensive than designing them in.
Putting the five together
The pattern across all five lessons is that production reliability comes from engineering discipline, not model choice. Narrow scope makes the problem tractable; a validated harness makes the agent safe; observability and evals make quality visible; layered guardrails and budgets make failures contained; and human oversight makes the high-stakes cases survivable. None of this is exotic. It is ordinary software engineering applied honestly to a non-deterministic component. The teams that ship are the ones that resist the temptation to skip these steps because the demo already looked impressive. The demo is not the product. The harness, the evals and the guardrails are.
In practice we run a short readiness check before any agent goes live: is the scope a single workflow with a tracked metric; does every tool call pass through schema and permission validation; is there eval coverage built from real failure traces; are the five guardrail layers, step limits and a spend budget in place; and is there a human approval gate on every high-stakes action. An agent that cannot answer yes to all five is not ready, however good its demo looked. That checklist is unglamorous, and it is precisely what separates the 57% of teams running agents in production from the third still blocked by quality.
FAQ
How eCorpIT can help
eCorpIT is a Gurugram-based technology organisation with senior-led engineering teams that ship production AI agents for enterprise clients. We scope the first workflow, build the validated harness, stand up observability and evaluation pipelines, and design the layered guardrails, spend controls and human-in-the-loop handoffs that keep agents reliable in production. Founded in 2021 and assessed at CMMI Level 5, we treat agent delivery as disciplined software engineering rather than a demo. To scope a production agent for your team, contact us.
References
_Last updated: 26 June 2026._