Engineering

5 hard-won lessons from shipping production AI agents in 2026

Five engineering lessons from shipping production AI agents: narrow scope, harness over model, observability and evals, layered guardrails, and human oversight.

Read time: 12 min
Word count: 1.8K
Sections: 11
FAQs: 8

By Manu Shukla

Founder & Director June 26, 2026

Production agents are made reliable by the harness, evals and guardrails around the model.

On this page · 11 sections

The production maturity gap
Lesson 1: Narrow the scope until it ships
Lesson 2: the harness is the product, not the model
Lesson 3: Instrument from day one
Lesson 4: Layer the guardrails, cap the spend
Lesson 5: Keep a human where the cost of error is high
India-specific considerations
Putting the five together
FAQ
How eCorpIT can help
References

Summary. Shipping an AI agent to production is mostly an engineering problem, not a model problem. The data is sobering: MIT found 95% of generative AI pilots drive no measurable result, Gartner expects more than 40% of agentic projects to be canceled by 2027, and LangChain's State of AI Agents survey of 1,340 respondents (18 November to 2 December 2025) found that while 57% of teams have agents in production and 89% have observability, only 52.4% run evaluations, and 32% name quality as the top barrier. Klarna's reversal, after its agent did the work of about 700 staff for roughly $40 million in benefit, then needed humans back for hard cases, is the cautionary tale every delivery team should know. At eCorpIT, founded in 2021 and assessed at CMMI Level 5, our agent delivery work points to the same five lessons the industry data now confirms: narrow the scope, engineer the harness, instrument from day one, layer the guardrails, and keep a human where errors are costly. Here is each, with the failure it prevents.

Andrew Ng, founder of DeepLearning.AI, called the shift early: "I think AI agentic workflows will drive massive AI progress this year, perhaps even more than the next generation of foundation models." He was right about the potential. What the past year taught delivery teams is that the gap between a demo and a dependable production agent is paved with unglamorous engineering. These are the five lessons that close it.

Lesson	The failure it prevents	The control
Narrow the scope	Pilots that never reach P&L	One bounded workflow with a tracked metric
Engineer the harness	Brittle, unpredictable agents	Validated tool calls, layered architecture
Instrument from day one	Silent quality decay	Observability plus evals on real traces
Layer the guardrails	Runaway loops, denial-of-wallet	Five guardrail layers, budgets, least privilege
Keep a human in the loop	Costly errors in high-stakes tasks	Approval gates where error cost is high

The production maturity gap

The market has moved from "can we build an agent" to "can we keep it reliable." LangChain's survey shows adoption is mainstream but maturity is not, and the gap between the two is exactly where these lessons live.

Metric	Share	What it tells you
Teams with agents in production	57%	Adoption is mainstream
Large enterprises (10K+ staff) in production	67%	Big firms lead deployment
Teams with observability in place	89%	Monitoring is near-universal
Teams running offline evaluations	52.4%	Eval discipline lags badly
Teams naming quality the top barrier	32%	Reliability is the real wall

The shape of this data is the whole story. Almost everyone monitors, barely half evaluate, and a third are blocked by quality. The teams stuck in pilots are not short on models; they are short on the engineering habits below. Each lesson maps to one of these gaps.

Lesson 1: Narrow the scope until it ships

The single biggest predictor of an agent reaching production is a small, well-defined job. The 95% of pilots that fail to show measurable results, per MIT, almost always tried to automate too much at once. The agents that ship solve one bounded workflow with a metric you already track, so the before-and-after is undeniable.

We treat the first version of any agent as a scoping exercise, not a capability demo. Pick the workflow where the inputs are structured, the success metric exists, and the cost of a wrong answer is survivable. Prove it there, instrument the value, then expand to the adjacent workflow. The teams that try to boil the ocean join the 40% of agentic projects Gartner expects to be canceled by 2027. We expand on picking that first workflow in our notes on enterprise AI agent use cases that reached production.

Lesson 2: the harness is the product, not the model

Most agents fail in production not because the underlying model is weak, but because the harness around it is brittle, insecure or unpredictable. The lesson that took us the longest to internalise is to treat the agent as a software system first and an AI product second, applying version control, automated testing, deployment pipelines and SRE practices to every layer.

The concrete rule that prevents the most incidents: never let the model call tools directly. The model returns a structured tool call; the harness validates the schema, checks permissions, executes the action, and injects the result back. Forward-thinking teams, as production engineering guides put it, separate the reasoning, action-execution and validation layers so each can be tested and iterated independently. A model that proposes and a harness that disposes is far safer than a model with its hands directly on your systems. The mental model that helps most is to treat agents as distributed-systems components, with the same expectations of retries, idempotency, timeouts and graceful degradation you would demand of any service that calls other services. This separation is the same discipline behind platform engineering and modernization patterns we apply elsewhere.

Lesson 3: Instrument from day one

There is a revealing gap in the LangChain data: 89% of teams have observability, but only 52.4% run evaluations. Observability tells you what happened; evals tell you whether it was right. Both are needed, and the second is where most teams are behind.

We log every tool call and monitor every token from the first commit, because the alternative is debugging a non-deterministic system blind. Just as important, we build the evaluation set from real failure traces, not from handpicked prompts: last week's production failures become this week's regression tests. That practice is what catches silent quality decay, the kind that erodes an agent until users quietly stop trusting it. Given that 32% of teams in the LangChain survey name quality as their top production barrier, eval discipline is not optional polish; it is the difference between an agent that holds up and one that drifts.

Lesson 4: Layer the guardrails, cap the spend

A single agent failure mode can be expensive in a way traditional software is not. One flaky API can turn an agent into a self-inflicted denial-of-wallet attack, looping and burning tokens until someone notices the bill. The discipline that prevents this is defence in depth: stack independent guardrail layers so that if one fails, another catches the problem.

Guardrail layer	What it catches	Example control
Input guards	Malicious or malformed requests	Validate and sanitise inputs
Tool and action gating	Unauthorised or risky actions	Per-tool approval, least privilege
Output guards	Unsafe or off-policy responses	Filter and validate outputs
Human approval	High-stakes state changes	Confirmation step before commit
Evals as feedback	Regressions over time	Continuous evaluation on traces

The failure modes multiply once agents coordinate. Common ones include selecting the wrong cluster, environment or account; conflicting decisions between agents in a multi-agent system; and unauthorised state changes with no approval gate. Each is an argument for tighter action gating and least privilege, not for a more capable model. A more capable model with the same loose permissions simply fails faster and more confidently.

Beyond the layers, two hard limits earn their keep. Enforce timeouts and a maximum number of steps, on the order of no more than 12 tool calls or 90 seconds per task, to stop runaway loops, and set a per-task spend budget. Treat agent identity as first-class too: service principals, short-lived tokens and least-privilege scopes, because an agent inherits whatever permissions you grant it, and most incidents trace back to an agent that could do more than it should. Our work on enterprise AI agent governance layers goes deeper on this control plane.

Lesson 5: Keep a human where the cost of error is high

The most expensive lesson in the market belongs to Klarna. Its AI agent did the work of roughly 700 staff and was credited with about $40 million in annual benefit, then the company rehired humans for disputes, complex refunds and hardship cases after quality slipped. The durable design is hybrid: the agent handles routine, high-volume work, and a human owns the decisions where a wrong answer is costly.

This is not a temporary limitation to engineer away. The International AI Safety Report 2026, led by Yoshua Bengio and authored by more than 100 experts, concludes that agents complement rather than replace humans, and that the asymmetry between agent capability and human oversight is one of the most urgent challenges in AI safety. We design the human handoff before launch, not after a quality incident, and we put approval gates exactly where the cost of an error rises: money movement, irreversible changes, anything touching health or legal outcomes.

India-specific considerations

For Indian engineering teams, these lessons land slightly differently on the economics. The labour-cost arbitrage that made Klarna's headcount story dramatic is smaller where salaries are lower, so the strongest first-agent cases in India are usually internal: developer tooling, document processing and helpdesks, where value shows up as engineer hours and cycle time rather than rupees saved on staffing. The guardrail and least-privilege discipline matters more, not less, because agents that touch customer data fall under the DPDP regime, and our DPDP consent manager guidance applies to any agent that reads or routes personal data. Build the spend caps and audit logs in from the first sprint; retrofitting them after a budget surprise or a data-protection query is far more expensive than designing them in.

Putting the five together

The pattern across all five lessons is that production reliability comes from engineering discipline, not model choice. Narrow scope makes the problem tractable; a validated harness makes the agent safe; observability and evals make quality visible; layered guardrails and budgets make failures contained; and human oversight makes the high-stakes cases survivable. None of this is exotic. It is ordinary software engineering applied honestly to a non-deterministic component. The teams that ship are the ones that resist the temptation to skip these steps because the demo already looked impressive. The demo is not the product. The harness, the evals and the guardrails are.

In practice we run a short readiness check before any agent goes live: is the scope a single workflow with a tracked metric; does every tool call pass through schema and permission validation; is there eval coverage built from real failure traces; are the five guardrail layers, step limits and a spend budget in place; and is there a human approval gate on every high-stakes action. An agent that cannot answer yes to all five is not ready, however good its demo looked. That checklist is unglamorous, and it is precisely what separates the 57% of teams running agents in production from the third still blocked by quality.

FAQ

How eCorpIT can help

eCorpIT is a Gurugram-based technology organisation with senior-led engineering teams that ship production AI agents for enterprise clients. We scope the first workflow, build the validated harness, stand up observability and evaluation pipelines, and design the layered guardrails, spend controls and human-in-the-loop handoffs that keep agents reliable in production. Founded in 2021 and assessed at CMMI Level 5, we treat agent delivery as disciplined software engineering rather than a demo. To scope a production agent for your team, contact us.

References

State of AI Agents survey, LangChain

Gartner predicts over 40% of agentic AI projects will be canceled by 2027

MIT report: 95% of generative AI pilots are failing, Fortune

Klarna reverses course and hires more humans, Entrepreneur

AI agents in production: engineering guide, Kenility

AI agent best practices: production-ready harness engineering

The 2026 playbook for agentic AI ops: guardrails, costs and reliability, ICMD

AI agent risks and guardrails: enterprise security guide, Atlan

International AI Safety Report 2026

Andrew Ng on AI agentic workflows

Building production-ready AI agents in 2026, MLflow

Five guides to building and scaling production-ready AI agents, Google Cloud

_Last updated: 26 June 2026._

Frequently asked

Quick answers.

01 Why do most enterprise AI agents fail to reach production?

MIT found 95% of generative AI pilots show no measurable result, and Gartner expects over 40% of agentic projects to be canceled by 2027. The causes are rarely model quality. They are over-broad scope, missing evaluation, weak guardrails and no tracked metric. Agents that ship solve one bounded workflow with a clear before-and-after measure.

02 What does "the harness is the product" mean?

It means the engineering around the model determines reliability more than the model itself. The harness validates the model's tool calls, checks permissions, executes actions and feeds results back, instead of letting the model touch systems directly. Treating the agent as a software system, with version control, tests and SRE practices, prevents most production incidents.

03 How important are evaluations versus observability?

Both are needed, and evals are where teams lag. LangChain found 89% of teams have observability but only 52.4% run evaluations. Observability shows what the agent did; evals show whether it was correct. Building an evaluation set from real failure traces, not handpicked prompts, is what catches the silent quality decay that erodes user trust.

04 What guardrails do production agents need?

Stack independent layers: input guards, tool and action gating, output guards, human approval for high-stakes actions, and evals as a feedback loop. Add hard limits such as a maximum number of steps and a per-task spend budget to stop runaway loops, and enforce least privilege with short-lived tokens so an agent cannot do more than its task requires.

05 What is a denial-of-wallet failure?

It is when an agent loops or retries uncontrollably, often triggered by one flaky API or tool, and burns tokens or API calls until the cost balloons. Because agents act autonomously, a small bug can become a large bill quickly. Timeouts, step limits and per-task spend budgets are the controls that prevent it.

06 When should a human stay in the loop?

Wherever the cost of an error is high: money movement, irreversible actions, and anything touching health, legal or safety outcomes. Klarna's reversal showed that removing humans from hard cases lowered quality. The International AI Safety Report 2026 concludes agents complement rather than replace humans, so design the handoff and approval gates before launch, not after an incident.

07 How do these lessons apply to Indian teams?

The economics favour internal first agents, such as developer tooling, document processing and helpdesks, where value shows up as engineer hours rather than headcount savings. Guardrails and least privilege matter more because customer-data agents fall under DPDP. Build spend caps, audit logs and consent handling in from the first sprint rather than retrofitting them later.

08 Is model choice irrelevant then?

Not irrelevant, but secondary. A better model raises the ceiling, yet production reliability is decided by scope, harness, evals, guardrails and oversight. Most teams over-invest in choosing a model and under-invest in the engineering around it. Get the harness and the guardrails right and a mid-tier model in a disciplined system beats a frontier model in a brittle one.

About the author

Manu Shukla

Founder & Director

Founder of eCorpIT. Hands-on engineer leading senior-only delivery for AI apps, custom software, and cloud systems for global clients.

One engineering note a week. No fluff, no spam.

Senior-architect playbooks on AI agents, mobile apps, cloud, security, data, and marketing — delivered every Wednesday.

Past the reading

Read enough. Let's build something.

A senior architect responds in 24 working hours with scope, indicative cost, and a timeline. NDA before any technical conversation.

Talk to an architect Browse the 10 practices