TL;DR
We built two identical AI agents with the same capabilities but different levels of supervision. One acted like an unsupervised intern; the other like a properly onboarded employee. The results validated a crucial insight: AI agents need job descriptions, KPIs, and guardrails, just like human employees.
The Problem We Wanted to Solve
Everyone's building AI agents. Most are building them wrong.
The typical approach: give an LLM some tools, write a prompt that says "you're a helpful assistant," and hope for the best. This works for demos. It fails in production.
We'd seen this pattern repeatedly:
- • Agents that hallucinate data when they can't find it
- • Reports with confident-sounding but unverified claims
- • No way to know when the agent is guessing vs. when it actually knows
- • Zero escalation. The agent never says "I'm not sure, a human should check this"
So we ran an experiment.
The Experiment Design
We built two AI agents using our internal agent framework. Both had:
- Identical job descriptions: Lead Generation Analyst for a web development agency
- Same tools: Web search, AI-powered research, browser automation, file writing
- Same task: Find potential customers who need web app development
The only difference was supervision level.
Agent 1: "The Overconfident Intern"
- • No guardrails
- • Encouraged to move fast
- • Could make assumptions when data was missing
- • No source citation requirements
Agent 2: "The Supervised New Hire"
- • Mandatory confidence scores (HIGH/MEDIUM/LOW) for every claim
- • Required source URLs for all factual statements
- • Escalation rules: flag for human review when confidence < 70%
- • Pre-output validation: reports blocked if missing citations
Overconfident agent (left) vs Supervised agent (right)
What We Built
Both agents received the same "new hire" treatment:
Job Description
Role: AI Lead Generation Analyst
Responsibilities:
- - Find businesses actively seeking web app development
- - Research prospect background, needs, and timing signals
- - Qualify leads based on fit with our service
- - Produce actionable sales briefs
KPIs
- - 70%+ accuracy on company/contact information
- - 100% source citation rate
- - Zero unverified budget assumptions
- - <30 minutes research time per prospect
90-Day Plan
- Days 1-30: Learn tools, practice with citations
- Days 31-60: Increase speed, refine pattern recognition
- Days 61-90: Achieve target metrics, minimal oversight
The supervised agent also received guardrails implemented as hooks: code that runs before every tool use.
Guardrail Hook Example
async def confidence_check_hook(input_data, tool_use_id, context):
"""Block reports missing source citations."""
if tool_name == "Write" and "Lead:" in content:
if "Sources" not in content or "http" not in content:
return {
"hookSpecificOutput": {
"permissionDecision": "deny",
"permissionDecisionReason": "Missing sources. Add URLs."
}
}
return {} The Results
We ran both agents on the same lead generation task. Here's what happened:
Overconfident Agent Output
The agent found two leads and produced detailed sales briefs. Looking at one:
Lead: Awayte Inc (Pet Tech Startup)
- • Fit Score: 9/10
- • Overall Confidence: 85%
- • Sources: 6 URLs cited
- • Included: Company background, founder name, technical issues, specific outreach email template
Impressively thorough. But here's the catch: the agent included this line.
"Budget Likelihood: MassChallenge alumni typically have $50K-$500K in resources"
That's an assumption. MassChallenge is a non-equity accelerator, so participants don't necessarily have that budget range. The agent filled in a gap with a plausible-sounding guess.
This is exactly the failure mode we predicted.
Supervised Agent Behaviour
The supervised agent, when attempting to write a similar report, would be blocked by our guardrails if:
- • Budget claims lacked source URLs
- • Confidence scores weren't included
- • Too many LOW confidence ratings without escalation notice
The hooks enforce the discipline that the overconfident agent lacks.
Key Learnings
1. Job Descriptions Actually Matter
When we gave the agent a clear role definition (not just "you're a helpful assistant" but specific responsibilities, target customers, and success metrics), the output quality improved dramatically.
The agent knew:
- • Who it was looking for (startups needing MVPs, companies with outdated websites)
- • What signals to watch (job postings for developers, forum discussions about app ideas)
- • What format to deliver (specific Sales Brief template)
2. Guardrails Prevent the Worst Failures
The overconfident agent made budget assumptions because nothing stopped it. The supervised agent couldn't. The hooks would reject the report.
This isn't about making the AI "dumber." It's about making failures visible and preventable.
3. Confidence Scoring Creates Accountability
Forcing the agent to rate its own confidence (HIGH/MEDIUM/LOW) per claim creates a forcing function. The agent must actually assess what it knows vs. what it's inferring.
Confidence Breakdown:
- - Company Info: HIGH (verified from Crunchbase, LinkedIn, company website)
- - Contact Info: MEDIUM (name verified, email unconfirmed)
- - Need Signal: HIGH (website is provably down)
- - Timing: HIGH (current technical emergency)
This gives the human reviewer exactly what they need: where to focus verification effort.
4. Escalation is a Feature, Not a Bug
The supervised agent's instruction to escalate when confidence < 70% seems like a limitation. It's actually a superpower.
A human reviewing 10 high-confidence leads is efficient. A human reviewing 10 leads where 3 are flagged "ESCALATION REQUIRED - contact info unverified" knows exactly where to spend time.
The Technical Implementation
Built with:
- • Python Agent SDK - For building production-grade AI agents
- • Hooks - Pre/post tool execution callbacks for guardrails
- • Web Search + Research APIs - For gathering information
- • Browser Automation - For website verification
Key architectural decisions:
- 1. Shared base prompt: Both agents inherit identical job description/KPIs
- 2. Guardrails as hooks: Validation logic runs before every Write operation
- 3. Absolute paths: Agents need explicit file paths to reliably save outputs
- 4. Confidence as structure: Not just text, but a predictable format for downstream processing
What This Means for Production AI Agents
If you're building AI agents for real business use, consider:
Treat Prompts Like Job Descriptions
- • What is this agent's role?
- • What are its responsibilities?
- • What does success look like?
- • What should it NOT do?
Implement Guardrails as Code
- • Don't rely on prompt instructions alone
- • Use hooks/callbacks to enforce rules
- • Block outputs that don't meet standards
Make Confidence Explicit
- • Require the agent to rate its certainty
- • Use structured formats, not just prose
- • Flag low-confidence items for review
Design for Escalation
- • Decide what needs human judgment
- • Make escalation expected, not a failure
- • Track escalation rate as a quality metric
Conclusion
The difference between a useful AI agent and an expensive toy isn't the model's intelligence. It's the scaffolding around it.
Our experiment confirmed what we suspected: AI agents need the same onboarding structure as human employees. Job descriptions. Clear expectations. Supervision. Guardrails. Escalation paths.
The overconfident agent was faster. The supervised agent was trustworthy.
In production, trustworthy wins every time.
Ready to Build Production-Grade AI Agents?
At Agentive™, we apply these principles to every AI employee we deploy. Job descriptions, KPIs, guardrails, and human oversight, all built in from day one.