DSPy in practice: what changes when the router is already an LLM but isn't yet compilable

The problem DSPy solves isn't the absence of AI in routing. It's the absence of a contract on that AI's output.

26 MAR 2026·7 min read·DSPy / LLM Routing / Pydantic / Migration

In the last posts I talked about data modeling in Redis Stack and about DSPy as a framework that treats prompts as compilable code.

Today I want to show how those two topics connect in practice, using e-commerce as the example/domain, reproducing a real migration I did on a conversational assistant.

The point I want to make clear up front: the problem DSPy solves isn't the absence of AI in routing. It's the absence of a contract on that AI's output.

The starting point: Agno, tools with DTOs, Redis as text

The original assistant was built with the Agno framework. The architecture was elegant and worked: an agent with a detailed system prompt, a set of tools with typed DTOs, and Redis as state storage (key/value, text).

Each tool had its input DTO:

This works. The LLM reads the system prompt, understands which tool to use, builds the arguments, and the DTO validates on entry. For a catalog with controlled volume and predictable queries, this architecture delivers.

Where the Agno architecture starts hitting its limit

As the catalog and query variety grew, three problems emerged. None of them are bugs. They're structural limitations of routing based on a free-form prompt.

Problem 1: the LLM's output has no verifiable contract.

Agno lets the LLM decide which arguments to pass to the tool. The LLM is smart and usually gets it right. But "usually" isn't auditable. When the user asks "men's Nike sneakers up to 300 reais with free shipping" and the tool receives {"category": "sneakers", "price_max": 300} without free_shipping=true, you have no systematic way to detect this beyond manual log review.

Problem 2: large prompt + model update = unpredictable behavior.

When the provider updated the model version, some classifications changed silently. Queries that used to go to BuscarProdutosTool started going to RankingCategoryTool - no exception, no test broken. The behavior simply changed, and was discovered by user complaint.

Problem 3: improving routing is a manual trial-and-error process.

Adding a new use case (for example, search by product compatibility) means editing the system prompt, manually testing affected scenarios, and hoping the order of instructions in the prompt doesn't break the cases that already worked. There's no dataset, no metric, no replay of historical cases.

The root cause of all three problems is the same: routing was correct in intent, but opaque in execution.

What DSPy changes (and what stays)

Before showing code, it's important to say what stays the same.

The LLM is still the one routing. We don't go back to if/elif. We don't trade intelligence for rules. What changes is how that decision's output is specified, validated, and improved over time.

The central change is this:

The Signature: specifying the contract

The Signature isn't a prompt. There's no "answer like this", "be objective", "don't make things up". It's a declarative contract: this goes in, this comes out, these are the classification rules. DSPy assembles the actual prompt - including compiled few-shot demos - automatically.

RouterOutput: typed output before reaching the tool

}

In the Agno model, the LLM built the tool's kwargs directly. If it forgot free_shipping, the DTO received None and the search ignored the filter. No log, no detection.

With RouterOutput, the LLM produces a JSON that goes through coercion before any tool is called. A missing free_shipping becomes explicit None, which is auditable, observable in the Langfuse trace, traceable in the recompilation dataset.

The asymmetric metric: where business priority becomes code

For the optimizer to work, you define a metric. The temptation is simple accuracy: correct fields / total fields.

The problem: that metric accepts an absurdity. An example that got scope wrong but got the other 10 fields right scores 0.91 and is accepted as a demo by BootstrapFewShot. The model learns that wrong scope with right filters is acceptable.

It isn't. Wrong scope means wrong tool. Wrong tool means Redis receives a completely different query, and the user gets a silently incorrect response.

This asymmetry is intentional and needs to be documented:

scope is worth 2.50 because it's the routing decision
sort_direction is worth 0.50 because desc vs asc on a search has less impact than sending the query to the wrong scope

BootstrapFewShot rejects any demo with wrong scope, regardless of how many fields were right. It's the optimizer acting as automated QA with the rules you defined.

Output coercion: the layer the docs ignore

The LLM doesn't always deliver the same format. Sometimes clean JSON. Sometimes JSON inside a markdown fence. Sometimes a scope field with a value slightly off ("buscar_produtos" instead of "busca_catalogo").

In the Agno architecture, Pydantic caught this at the tool entry and returned an error for the LLM to retry. With DSPy, coercion happens before any tool is called:

The important architectural difference: in the Agno model, a parsing failure could lead the LLM to retry with reformulation, adding latency and consuming extra tokens. Here, the failure is absorbed in coercion and the graph continues with scope=geral. The worst case is a generic response, not an error visible to the user.

What changed in Redis alongside DSPy

The migration to DSPy came with the change from Redis-as-text to Redis Stack. The two changes are complementary.

In the previous model, the tool fetched products from the API, serialized the result as a JSON string, and stored it in Redis as a simple cache. The LLM received raw JSON and formatted the response.

With Redis Stack, the document is indexed at ingestion:

The RouterOutput that DSPy produces feeds directly into the QueryBuilder that builds the RediSearch query. The scope decides which index. The filters become predicates. The projection varies with intent (Slim for listing, Fat via JSON.GET for detail).

The complete chain:

What LangGraph gained from this

The router_node in LangGraph became declarative:

The graph doesn't make business decisions. It reads a field from state. Who made the decision was DSPy - with a verifiable contract, a metric that captures business priority, and a reproducible improvement process.

When the provider updated the model, the response wasn't to edit 80 lines of system prompt and hope. It was to run the recompilation script with the new model as target and validate the metric on the validation set before deploying.

What got more complex (but is worth it)

This model has more pieces than the original Agno:

Declarative Signature instead of system prompt
RouterOutput Pydantic instead of free kwargs to the tool
Defensive coercion layer
Asymmetric metric with documented weights
NDJSON dataset with supervised examples
Versioned recompilation script

That's real complexity. There's no point minimizing it.

The trade that justifies the complexity: router behavior went from "I believe the prompt is correct" to "I have evidence the model classifies 94.7% of validation-set queries correctly, with zero tolerance for scope errors".

For an assistant with low volume and stable use cases, Agno with a good prompt suffices. For an assistant that grows in volume, swaps models periodically, and needs to trace exactly where routing failed, DSPy solves what the prompt can no longer solve on its own.