Testing set-up for LLM applications: RPC with Rhesis connector

Dr. Harry Cruz
January 6, 2026
13 mins

Setting up LLM testing: An inconvenient truth

We had a problem with Rhesis. The platform could register REST endpoints for testing LLM applications, but the process was painful. Developers had to manually configure endpoints through our UI, mapping request parameters and response fields. Getting the mappings right was tedious and error-prone. One wrong JSONPath expression and the endpoint wouldn't work.

We needed something simpler. Developers shouldn't have to think about request mappings and response transformations. They should just point at their functions and say "test this."

We looked at how observability tools handle this. Tools like OpenTelemetry make instrumentation easy, you decorate your functions and telemetry flows to the platform automatically. But there's a fundamental difference: observability tools work in one direction. They send data about traces to an endpoint. That's fine for observability, where you're collecting data about production traffic that's already happening.

For Rhesis, one direction isn't enough. We don't just observe production traffic, we need to trigger test runs on demand. When a developer configures a test, we need to invoke their LLM function with specific inputs, collect the output, and evaluate it. We can't wait for the right production request to come in. We need bidirectional control.

Illustration of Bidirectional Control
Illustration of Bidirectional Control

Rhesis Connector: A convenient solution

So we built the connector. A simple Python decorator that automatically registers functions with the platform and establishes a persistent WebSocket channel. Developers decorate their functions, and Rhesis can trigger them whenever a test is needed. No manual endpoint configuration. No mapping errors.

A fortunate side effect emerged: the same bidirectional channel that enables testing could also support observability scenarios. But we're getting ahead of ourselves. This is the story of how we built it.

The connector concept

Before the connector, registering an endpoint looked like this:

  1. Open the Rhesis UI
  2. Create a new endpoint configuration
  3. Enter the REST URL
  4. Write Jinja2 templates to map Rhesis's standard format to your function's parameters
  5. Write JSONPath expressions to extract values from your response
  6. Debug when the mappings inevitably break
  7. Repeat for every function you want to test

It was tedious. And worse, error-prone. One typo in a JSONPath expression and the endpoint silently fails.

And most importantly, it required an available REST endpoint in the first place and if that is not already present in the application, the set-up places a burden on the team before any value could be delivered by Rhesis.

We designed the connector to eliminate all of that. A persistent WebSocket link between an SDK client and the Rhesis backend with two communication patterns:

SDK → Backend: Functions register with their schemas, declaring what's available for testing.

Backend → SDK: Execution requests trigger those functions remotely, collecting results for evaluation.

Developer experience

We invested a considerable amount of time on the interface. It had to be simple. A single decorator:

@collaborate()
def chat(input: str, session_id: str = None):
    result = llm.generate(input)
    return {"output": result, "session_id": session_id}

That's it. No UI configuration. No manual mappings. The decorator handles WebSocket lifecycle, function registration, execution coordination. Functions execute in their natural environment with real dependencies, not isolated test harnesses. We didn't want to force developers to change how their code worked.

Automatic mapping generation

Here's where it gets interesting. Rhesis expects a standardized format, containing fields like, input, session_id, context, etc. But developers name their parameters whatever makes sense for their application: user_message, conv_id, question, thread_id.

Previously, developers had to write the mappings manually. Now the connector figures it out automatically using a 4-tier approach:

  1. Pattern matching: Tries to match parameter names semantically (e.g., user_messageinput, conv_idsession_id)
  2. LLM fallback: If pattern matching is not successful, we use the configured LLM to intelligently generate mappings
  3. Manual override: Developers can still provide explicit mappings in the decorator if needed
  4. UI edits: Any manual tweaks made in the UI are preserved

Most of the time, pattern matching just works. For functions with unusual naming, the LLM figures it out. And when you need precise control, you can still provide explicit mappings.

The mappings that used to require careful manual configuration with Jinja2 templates and JSONPath expressions now happen automatically when the decorator runs. We introspect the function signature, analyze the parameter names, and generate the correct mappings on the fly.

Test execution at scale

Once we had SDK-to-backend connectivity working, we could establish a connection and test endpoint connectivity. That worked great for a single test. But we needed to run hundreds of tests at once, evaluating LLM applications means testing multiple scenarios, edge cases, and conversational flows in parallel.

Running tests directly on the backend wasn't an option. The backend serves API requests and maintains WebSocket connections. Blocking it with test execution would kill performance.

We needed a separate component: Worker. This is a collection of Celery worker nodes that handle test execution asynchronously. You configure a test, the backend queues it, and workers pick it up and execute it in parallel. This architecture scales very well horizontally, if you need more test capacity, you can just add more worker nodes.

But this created a fundamental problem. Worker nodes are completely separate processes, often running on different machines. They have no notion of the WebSocket connections that exist in the backend. Only the backend itself has those connections. So we needed a way for workers to communicate with the backend, which in turn triggers functions in the target application via the SDK.

This is where things got interesting, and where Redis entered the picture.

Architecture overview

The system we ended up with spans four layers working across process boundaries:

Architecture Overview
Architecture Overview

Component responsibilities

  • SDK: Manages WebSocket connection, executes decorated functions
  • Backend: Multiple Gunicorn worker processes maintain WebSocket connections and serve API requests (each with isolated memory)
  • Worker: Celery worker nodes that execute tests asynchronously (separate processes/machines)
  • Redis: Bridges process boundaries with connection state and pub/sub RPC

The Redis layer came later, after we discovered the hard way that workers couldn't access the backend's memory. It feels blatantly obvious now, but this was evident only later in the development cycle. The complexity was compounded by multiple backend containers in our cloud setup, each running multiple worker processes.

Design decision: WebSocket over HTTP

We debated this issue at length. The requirement was clear: trigger functions in user applications with minimal latency, potentially executing hundreds of test cases in rapid succession.

Why not HTTP polling?

We prototyped HTTP polling first. The application asks "do you have work for me?" every 100ms. Even at that aggressive rate, we're burning 10 requests per second per connected application. And we still have 100ms of latency before execution even starts. It felt wasteful and slow.

Why WebSocket won

WebSocket inverts the model. The connection stays open, we push requests when needed. Latency drops to network round-trip time, which is typically under 10ms. One connection at startup receives all work.

Trade-offs we accepted:

  • More complex connection management than stateless HTTP
  • Backend must track active connections
  • Network interruptions require reconnection logic with exponential backoff

Benefits we gained:

  • Real-time triggering matches production timing patterns
  • Proactive function registration without polling
  • No TLS/TCP overhead per interaction

For a testing platform, precise control over execution timing outweighed the operational burden. We committed to WebSocket.

The multi-process challenge

With Worker handling test execution separately from the backend, we hit the core distributed systems problem. The first implementation seemed straightforward: backend stores connections in memory, workers check that dictionary.

We deployed it, connected an SDK, triggered a test from a worker. Error: "SDK client is not currently connected."

Wait, what? The SDK was clearly connected. Direct API calls worked perfectly. What was going on?

Memory isolation

We spent quite some time debugging before it hit us, in a textbook 'duh' movement. The problem is fundamental to how operating systems work:

# Backend process memory
self._connections = {"project-a:production": <WebSocket>}

# Worker process memory (different address space)
self._connections = {}  # Completely empty

Worker nodes couldn't access that memory. The dictionary existed in a completely separate process's address space. Of course they couldn't see it.

Okay, Redis to the rescue. We added the connection state to Redis, deployed, tested again. Sometimes it worked. Sometimes it failed with the same error. Intermittent failures, a developer’s favorite kind of bug.

The race condition gets worse

More debugging. We discovered another layer of complexity: the backend itself runs with multiple Gunicorn worker processes (typically 4+) for handling API load. These are separate from the Celery Worker nodes. They're the backend's own processes for serving HTTP requests and maintaining WebSocket connections. Each backend worker process maintains its own _connections dictionary.

In our cloud setup, we also run multiple backend containers for redundancy. So we have multiple containers, each with multiple worker processes, each with isolated memory.

The race condition was brutal:

What was happening:

  1. SDK connects to backend worker process #1 → stores connection in its local dict, marks active in Redis
  2. Test execution request arrives at backend worker process #2 (different process, same or different container)
  3. Worker process #2 checks its local dictionary → finds nothing → immediately publishes "no connection found" error
  4. Worker process #1 receives the same request via Redis pub/sub → finds connection → forwards to SDK successfully
  5. Test completes, but the requesting Celery worker already received the error from step 3 and gave up

The test was succeeding, but we were reporting failure. The logs showed both the error and the successful result, milliseconds apart.

Design decision: hybrid storage

Now we had a choice to make: where should we store the connection state? The memory isolation problem was compounded by multiple backend worker processes and multiple containers. This meant we needed an orchestrator such as Redis for cross-process visibility. But Redis adds latency to every lookup.

We ended up with a hybrid approach:

  1. In-memory Python dict → Fast local access within each backend worker process (0ms overhead)
  2. Redis keys → Cross-process visibility for coordination (2-5ms overhead)

Each backend worker process maintains its own dictionary for fast lookups. When handling WebSocket connections or direct API calls, it checks its local dict first. For Worker nodes checking connection status or coordinating RPC, they query Redis.

The coordination fix

Backend worker processes now check Redis before publishing errors:

if key not in self._connections:
    # Check if another worker process has this connection
    redis_has_connection = await redis_manager.client.exists(
        f"ws:connection:{key}"
    ) > 0
    
    if redis_has_connection:
        return  # Another process owns it, let them handle it
    else:
        await publish_error(...)

This simple check fixed the race condition. If Redis says a connection exists, trust it and let the process that owns it handle the request. Only publish an error if Redis confirms the connection truly doesn't exist.

Alternative we considered

Move everything to Redis. Make it the single source of truth. But every connection lookup would hit Redis, adding 1-2ms latency and serialization overhead. For high-frequency API calls serving the REST API, that's unacceptable.

The trade-off

Synchronizing both stores during connect/disconnect adds complexity. But we optimize locally where possible (in-memory for same-process operations), and coordinate across processes only where necessary (Redis for cross-process communication). The hybrid approach balanced performance with distributed coordination across multiple containers and worker processes.

Redis Remote Procedure Calling pattern

Once we solved the multi-process visibility problem, we had the infrastructure for workers to invoke SDK functions through the backend. Redis became our coordination layer, implementing the communication bridge between workers and the backend's WebSocket connections. We built an Remote Procedure Calling (RPC) pattern using three data structures:

  • Connection status: ws:connection:{project_id}:{environment} → "active" (1 hour TTL)
  • Request channel: ws:rpc:requests (shared pub/sub)
  • Response channels: ws:rpc:response:{test_run_id} (per-request pub/sub)

How it works

Redis RPC Pattern
Redis RPC Pattern

A worker verifies the connection exists in Redis, subscribes to a response channel unique to this request, then publishes the request to the shared channel. The backend runs a background task listening on ws:rpc:requests. When a message arrives, it looks up the WebSocket in its local dictionary and forwards the request through it.

The SDK executes the function and returns results through the WebSocket. The backend publishes to the response channel. The worker, still listening, receives the result and continues.

Performance: The entire round trip takes 50-200ms. Redis adds only 2-5ms overhead, the rest is function execution time. Worker nodes timeout after 30 seconds to handle slow functions or disconnected SDKs.

The beauty of the pub/sub model: workers don't need to know which backend process owns a connection. They just publish to Redis and trust that the right process will pick it up.

Usage example

Here's what connecting an application looks like now:

from rhesis.sdk import RhesisClient, collaborate

# 1. Initialize at startup
client = RhesisClient(
    api_key="rh-your-api-key",
    project_id="my-project-uuid", #you will find this in the Rhesis UI
    environment="production"
)

# 2. Decorate functions
@collaborate(
    name="chat",
    description="Chat with the insurance assistant",
)
def chat(
    message: str,
    session_id: Optional[str] = None,
    use_case: str = "insurance",
    conversation_history: Optional[List[dict]] = None,
):
    result = llm.generate(message, use_case, conversation_history)
    return {"output": result, "session_id": session_id}

# 3. Use normally
response = chat("What is term life insurance?")

In this example, chat is your application’s entry point for incoming user requests, i.e., how your application actually talks to the world.

Three steps. No UI configuration. No manual mappings. Compare this to the old process: open UI, create endpoint, write Jinja2 templates, write JSONPath expressions, debug mapping errors. We went from a multi-step error-prone process to three lines of code.

In the screenshot below, you can see the new chat endpoint registered with the connection type SDK, living side-by-side with REST-type endpoints. In the details page, you can see that the function parameters have been mapped automatically.

Endpoints in Rhesis
Endpoints in Rhesis Platform
Detail Page Endpoints in Rhesis
Detail View of Endpoint in Rhesis Platform

A fortunate side effect: Observability

Building the connector for testing opened an unexpected door.

We realized the WebSocket channel we built could be used in more than one way. The same persistent connection that carries test requests could transport traces, metrics, and logs from production traffic. We're already serializing execution data, already maintaining the connection, already handling authentication. The infrastructure is sitting there.

The vision

Adopt OpenTelemetry semantic conventions directly. When a decorated function executes, whether triggered remotely by a test or called locally in production, it generates a span. That span flows through the WebSocket with execution metrics: time, token usage, cost. The backend correlates test results with production traces, showing how the same function behaves under testing versus real traffic.

The benefits

  • Reduced footprint: One connection, one auth mechanism, one reconnection strategy
  • Real-time observability: We see execution data as it happens
  • Immediate alerting: Live debugging during test runs
  • Shared infrastructure: No duplicate systems

We built this to solve endpoint registration and testing. But we accidentally created infrastructure for comprehensive instrumentation. The connector that started as a convenience for developers could become their observability pipeline too. Same channel, dual purpose. It feels elegant.

Conclusion

We set out to eliminate tedious endpoint configuration. What we built was a bidirectional connector that automatically maps function signatures, coordinates across distributed processes, and scales test execution.

The automatic mapping generation turned minutes of manual Jinja2 and JSONPath configuration into milliseconds of pattern matching with LLM fallback. The distributed coordination challenge (enabling Worker nodes to trigger functions through WebSocket connections in separate backend processes) forced us into Redis pub/sub RPC with hybrid storage. We optimize locally where possible, coordinate across processes only where necessary.

The patterns we implemented, lazy initialization, hybrid storage, pub/sub RPC, apply broadly to distributed systems where components must coordinate without shared memory.

What started as easier endpoint registration became infrastructure for bidirectional control. And it opened an unexpected door: the same channel can handle observability data, turning a testing connector into the foundation for a comprehensive instrumentation platform.

Share this post
Dr. Harry Cruz
January 6, 2026
13 mins