
I spent years building my AI skills. Actually before it was called “AI”. I built machine learning pipelines, trained random forests (yes, I am that old), then moved to model fine-tuning, prompt engineering, and evaluation frameworks. As a data scientist turned AI engineer, I thought these were the core competencies that would define success in this field.
Then I started building actual AI agents for real users, and I had to accept a humbling truth: my AI skills mattered far less than I thought. What mattered was whether the domain experts I was working with could actually shape what we were building. And most of the time, they couldn't, because we didn't have the right ways to collaborate.
The shift AI has brought to software development goes beyond coding assistants and faster deployments. The more fundamental change is that the people who understand the problem domain can no longer sit on the sidelines. They need to be at the center of building AI agents, not just during requirements gathering, but throughout the entire development cycle.
This matters because AI agents are fundamentally different from traditional software. A checkout form either processes a payment or it doesn't. An agent that helps doctors summarize patient histories? It might produce something that looks plausible but misses a critical detail buried in the notes. You need a doctor to catch that, not a data scientist.
I've watched this play out across multiple projects, and talked to enough engineers to know it's not just me. Right now, most teams handle collaboration through spreadsheets. A product manager creates an Excel file with test cases. Engineers run them manually, paste results back in, add comments. Domain experts review the output days later. Someone forgets to update a cell. Version conflicts emerge. The whole process becomes a bottleneck.
It works, barely. But it doesn't scale when you're testing hundreds of conversation paths, each with multiple turns, tool calls, and decision points. I've been in meetings where we spent an hour reconciling which version of the test sheet was current.
Consider a customer service agent handling subscription cancellations. The happy path is straightforward: customer asks to cancel, agent processes it, confirms. But real conversations branch constantly. The customer is frustrated and complaining. They mention a billing issue from three months ago. They ask about pausing instead of canceling. They start a new topic mid-conversation.
A legal research agent has similar complexity. It needs to understand nuanced queries, search the right databases, synthesize multiple sources, cite correctly, and recognize when it doesn't have enough information. I can verify the technical execution, but only a lawyer can evaluate whether the legal reasoning holds up.
These scenarios reveal failure modes that traditional testing misses entirely. Does the agent maintain context across turns? Does it escalate appropriately when uncertain? Does it recover gracefully from errors? You can't answer these questions with unit tests checking input and output. You need to simulate full conversations and have domain experts evaluate the behavior.
Most teams involve domain experts too late. Engineers build the agent, run some internal tests, then bring in experts for a final review. By then, core assumptions are baked in and expensive to change. I've been there, sitting through a meeting where a lawyer from the compliance department tore apart our testing strategy. We had no clear rationale for why we tested certain scenarios and not others. Worse, we'd completely missed critical money laundering patterns that should have been obvious from the start. We weren't negligent, we just didn't know what we didn't know.
Better to involve them from the start. Let the medical expert who knows how doctors actually phrase questions write test scenarios. Let the lawyer who understands edge cases in contract review define the tricky situations. Let the customer service manager who has seen every escalation pattern describe what good recovery looks like.
The challenge is giving these experts the tools to contribute directly, without requiring them to learn Python or wade through trace logs. Yes, this is a gentle poke at all the tracing tools out there ;)
Three things need to happen for real collaboration:
Scenario creation needs to be accessible. Domain experts should write test cases the way they think about problems: as realistic conversations with intent, context, and constraints. A cancer researcher might describe a scenario where a patient asks about clinical trial eligibility after mentioning several comorbidities. That narrative becomes a multi-turn test.
Evaluation needs their judgment. Running tests isn't enough. Someone needs to look at the agent's responses and assess: Did it maintain context? Did it ask appropriate clarifying questions? Did it recognize when to defer to human judgment? Experts can do this directly if the interface shows them full conversation traces, tool calls, and decision points rather than just raw outputs.
Feedback needs to close the loop. When an expert flags an issue, that insight should immediately inform the next iteration. Not through a long game of telephone, but by letting them organize findings, mark patterns, and help prioritize what to fix.
This is where purpose-built platforms for agent testing matter. Rhesis, for instance, lets non-technical stakeholders create and run test scenarios, evaluate results with full visibility into agent reasoning, and organize feedback into actionable tasks. The barrier between finding a problem and fixing it shrinks considerably.
Take that customer service cancellation agent. A support team lead creates twenty scenarios covering common frustration patterns. They run them, spot three cases where the agent's tone becomes defensive under pressure, and flag specific conversation turns where it happened. The engineering team sees exactly what went wrong and why it matters. The next build addresses it. The lead runs the tests again. The cycle continues.
This collaborative approach catches bugs, but that's almost secondary to what it really does. It surfaces assumptions early, when they're still cheap to change. It builds shared understanding between engineers and domain experts about what good behavior actually looks like in practice. It creates a corpus of realistic scenarios that becomes institutional knowledge about edge cases and failure modes.
Without this kind of structured collaboration, teams fall into two traps. Some ship agents after what is deemed "vibe testing": running through a handful of conversations manually, seeing that responses feel reasonable, and calling it done. You ask the agent a few questions, the answers seem fine, ship it. Then users encounter the first edge case and everything breaks. Yes. I've done this myself, more times than I'd like to admit.
Other teams go too far in the opposite direction. They get stuck in overly cautious QA where every change requires weeks of manual review. A committee reviews every response variation. Progress slows to a crawl. Neither approach scales when you need to iterate quickly while maintaining quality.
The tools we build around AI models determine who gets to shape how they behave. When those tools enable real collaboration, domain expertise becomes the driving force. When they don't, we're back to engineers guessing what good looks like and essentially “hoping for the best”.
Here's where things get interesting. If domain experts can describe test scenarios in natural language, and if those scenarios need to play out across multiple conversation turns that adapt based on how your agent responds, you're left with a challenge: someone needs to actually conduct those conversations.
A human tester could do it, but they'd need to run through dozens or hundreds of variations. They'd need to remember to probe specific edge cases, push on security boundaries, try different phrasings, and see how the agent handles interruptions or topic changes. It's tedious, time-consuming, and easy to miss things.
What if the testing itself could be more intelligent? What if domain experts could describe what matters and what to look for, while an automated system executes the scenarios, explores natural variations, and surfaces the failures worth paying attention to?
That's where Penelope comes in. In the next post, we'll explore how Rhesis's testing agent turns scenario descriptions into thorough, adaptive simulations.