
LLM applications behave differently from traditional software. Outputs vary across runs, even with identical inputs. The prompt shapes behavior as much as the code does. Multi-turn conversations and agentic workflows multiply the possible execution paths.
Traditional testing assumed determinism: same input, same output. That assumption no longer holds. Testing LLM applications requires adapting to probabilistic outputs, semantic evaluation, and systems that can change without code changes (through model updates or retrieval drift).
This article walks through six phases that form a testing cycle for LLM and agentic applications: configuring projects, defining requirements, selecting metrics, generating tests, executing evaluations, and collaborating on results. This is how we currently approach it at Rhesis. Each phase feeds the next, and the cycle repeats as your system matures.
Think of it as an infinity loop. The left side (phases 1-3) establishes foundations: what are you testing and what does success look like? The right side (phases 4-6) focuses on execution: generating tests, running them, making sense of results together. Both sides feed each other continuously.

Early in development, you iterate through small loops. Define a requirement, generate a few tests, run them, review. Pre-release, loops widen to cover more behaviors. In production, loops become continuous regression cycles that catch errors before users do.

Projects are the top-level organizational unit. Each project groups related endpoints, tests, test sets, and results together for a specific LLM application or testing initiative.
A project can contain multiple endpoints representing different environments (development, staging, production) or different API configurations. This lets you validate that your application works correctly across environments before deploying changes. Connect your application via REST, WebSocket, or SDK connector so Rhesis can send inputs and capture outputs. The connection will also track traces when configured, since understanding the reasoning path helps diagnose failures later.
Document the application's purpose and boundaries. A customer-facing insurance advisor has different requirements than an internal RAG tool. These boundaries shape everything downstream.
If you're testing multiple applications, consistent project structure lets you compare workflows and spot patterns across your portfolio.

Requirements for LLM applications go beyond functional specs. You need to articulate expected behaviors: how should the application respond in various scenarios? What topics are off-limits? What tone should it maintain?
In Rhesis, behaviors are formalized expectations that describe how your system should perform. Each behavior represents a specific aspect you want to evaluate: response quality, safety, accuracy, or adherence to guidelines. This creates a two-layer structure where behaviors define what good looks like, and metrics (Phase 3) verify you're meeting those expectations.
Translate user needs into expected behaviors, then prioritize them. Some are hard failures if violated (safety guardrails, factual accuracy). Others matter for quality but tolerate occasional misses (helpfulness, tone consistency).
Bring domain knowledge into the process. Legal teams know what compliance violations look like. Product managers know which edge cases customers actually hit. Use file uploads or MCP connectors to bring in existing documentation. This shared language enables collaboration across developers, product managers, and domain experts. It also helps during test case generation later.
Consider adversarial scenarios explicitly. What happens when users try jailbreaks or inject malicious prompts? Include these as first-class behaviors to test, not afterthoughts.

A single metric won't capture whether your application meets expectations. You need a balanced set of metrics covering multiple behaviors.
In Rhesis, metrics are quantifiable measurements that evaluate behaviors and determine if requirements are met. Behaviors define what you expect; metrics measure how well you meet those expectations. Each behavior from Phase 2 connects to one or more metrics.
The platform includes pre-built metrics from multiple open-source projects and providers: DeepEval, Ragas, and Rhesis-developed metrics. These cover different evaluation needs and scopes. Single-turn metrics evaluate individual question-answer pairs (Answer Accuracy, Faithfulness, Context Relevance). Multi-turn metrics evaluate conversation dynamics (Knowledge Retention, Role Adherence, Conversation Completeness). Safety-focused metrics detect specific violations (Jailbreak Detection, PII Leakage Detection, Toxicity Detection).
For example, a "Reliability" behavior might connect to Answer Accuracy, Context Relevance, and Contextual Coherence. A "Robustness" behavior might connect to Jailbreak Detection and PII Leakage Detection.
Most metrics use an LLM as a judge to evaluate responses. With DeepEval and Ragas metrics, the evaluation criteria are fixed. Custom metrics let you define everything yourself: the evaluation criteria, scoring method, and applicable scope.
Specify how you'll judge pass/fail. Some requirements should pass only if no violations occur across repeated runs (safety constraints). Others can use thresholds: pass if 80% of runs meet the criterion.
Automated metrics and human reviewers may disagree on quality. Human reviews overriding automated scores is an important step for creating actionable insights.

Tests validate specific expectations, evaluated using the metrics assigned to behaviors.
Single-turn tests check how the AI responds to a single prompt with no follow-up. These are foundational for testing specific, important aspects: factual accuracy, safety guardrails, format compliance. Each test includes a prompt, category, topic, linked behavior, and optionally an expected output.
Multi-turn tests check how the AI behaves over multiple messages in a conversation. These validate conversation flow, context retention, and topic switches. Multi-turn tests are goal-based: you define what success looks like, optional instructions for how to conduct the test, and restrictions on what the target must not do. Penelope, an autonomous testing agent, adapts its strategy based on responses and conducts the conversation.
Create tests manually in the table-based view or generate them with AI assistance. During AI-based generation, provide as much context as possible and refer back to knowledge already uploaded or connected via MCP. The more context, the more targeted your generated tests. Include adversarial tests deliberately. Jailbreak attempts, prompt injections, requests for prohibited content. Your application will encounter these in production.
Organize tests into test sets for execution. A test set groups related tests that run together against your endpoints.
Build evaluation datasets that reflect reality. Apply five criteria: defined scope, demonstrative of production usage, diverse across edge cases, decontaminated from training data, and dynamic as the system evolves.
Be realistic about sample sizes. Twenty examples isn't evaluation. To cover your full list of expected behaviors, you need hundreds of test cases.

A test run is created when you execute a test set against an endpoint. It captures all test results, execution metadata, and evaluation metrics for analysis.
Automate test execution and integrate with CI/CD. Tests should run on every significant change: prompt updates, model swaps, retrieval index rebuilds. Choose execution options based on constraints. Parallel execution speeds up large suites but may stress rate limits. Sequential execution is slower but more predictable.
Handle non-determinism explicitly. Run tests multiple times and aggregate results. For behaviors that must always hold, a single failure should fail the test. For behaviors with acceptable variance, use threshold-based criteria.
Compare test runs to identify regressions and improvements. Select a baseline test run and view test-by-test comparison. Use filters to focus on what changed: improved tests that now pass, regressed tests that now fail, or unchanged results. Don't declare victory or panic based on small movements within normal variance.
Capture full execution traces. When a test fails, you need more than the final output. What did retrieval return? What tool calls did the agent make? Traces turn failures into debugging information.
Review test runs to validate or override automated evaluations. Human evaluators can mark individual tests as Pass or Fail with comments explaining the decision. This bridges the gap between automated metrics and real-world quality judgment.

Automated evaluation provides signals, but humans make decisions. Collaboration closes the loop by bringing domain experts, engineers, and stakeholders together to interpret results.
Reviewing test results in context is key. Why does this test case exist? What knowledge or document is it based on? How often did it fail in the past? A 72% pass rate means different things depending on which tests failed. Collaborative review surfaces patterns: "These failures all involve the same edge case" or "This metric seems miscalibrated." Pay attention to cases with conflicting human reviews or many comments.
Add human reviews to confirm automated evaluations. Automated metrics catch obvious failures but miss subtle quality issues. Schedule regular sessions where team members examine output samples, especially borderline cases. Human judgment grounds metrics in real-world quality expectations.
Manage tasks directly in context. Tasks help teams coordinate work related to testing activities, particularly for engineers who need to solve these complex puzzles later. When a test reveals a problem, create a task linked to that specific test, test set, or test result. Each task tracks status, priority, and assignee. Tasks assigned out of context lose essential details and get buried in communication tools or spreadsheets. Comments attached to specific tests preserve key knowledge where it belongs.
Feed learning back into earlier phases. Failure patterns might indicate missing requirements (revisit behaviors). Edge cases that keep appearing should become permanent test cases. Collaboration isn't the end of the cycle. It's where the cycle begins again.
Teams building LLM applications need infrastructure for connecting their applications, formulating expectations, generating tests, running evaluations, and reviewing results with context. Rhesis AI offers a collaborative, open-source platform designed around this testing cycle.
The platform connects to your applications, lets domain experts define behaviors without writing code, includes a library of proven metrics, generates single-turn and multi-turn tests including adversarial scenarios, and provides tools for reviewing results as a team.
Explore the repository on GitHub or check the documentation to see how it fits your workflow.