Building a production conversational assistant · Part 7

Routing isn't a single problem

Three routers, three different problems: DSPy, custom Semantic Router, and Aurélio AI

Before building the custom one I evaluated an open-source library that almost made it into the project. This is the comparison I wish I'd read before making those decisions.

1 APR 2026·10 min read·LLM Routing / Semantic Router / DSPy / Aurélio AI

In previous posts I showed how DSPy replaced Agno/LangGraph as an assistant's cognitive router, putting an output contract, a metric, and an improvement process on top of the LLM.

What I didn't show is that, in the same system, there's another router with a completely different architecture. And that, before building the custom one, I evaluated an open-source library that almost made it into the project.

This post is the comparison I wish I'd read before making those decisions.

The starting point: routing isn't a single problem

When you say intent router, you can be talking about very different things:

Problem A: The user typed "I want to cancel my order". Is it support, return, pre-shipment cancellation, or out of scope? Decision among a few macro candidates, with per-intent thresholds and ambiguity logic.

Problem B: The user typed "men's Nike sneakers up to $300 with free shipping and good rating". It's a catalog search. But with which tool? With which parameters? Scope + extraction of 10+ structured filters from a free sentence.

Problem C: You're starting a project, you need to route across 8 intents, and you don't want to build anything from scratch. You want something working in 20 minutes.

Each of those problems can have a different router as the answer.

Router 1: Custom embedding-based Semantic Router

What it is

A router that embeds the user query, computes cosine similarity against matrices of positive examples per intent, penalizes with negative examples, and applies a decision pipeline with configurable thresholds.

How it works in practice

Per-intent scoring with positives and negatives to adjust confidence scores. This solves Problem A: macro classification across intents with clear semantics. No LLM call, latency 20-50ms.

The pairwise reranker: where domain knowledge lives

For pairs of intents that frequently confuse, a heuristic reranker decides based on text patterns. The pairwise reranker is domain knowledge encoded as a rule. The embedding doesn't distinguish "I want to see my order status" from "I want to see men's sneakers" - both can be close to busca_catalogo. The rule resolves it deterministically and audibly.

CFG_PARAMS in Redis: hot reload without deploy

When a routing behavior needs adjustment in production (a new term that starts confusing intents, a threshold that needs calibration), you update CFG_PARAMS in Redis. No new deploy, no new compilation.

The work nobody documents: threshold calibration

The custom semantic router has a problem that only shows up in production: thresholds are set manually, and the choice of value has no direct mathematical backing.

How do you arrive at the right values? Trial and error with a set of test queries, watching where the router starts accepting queries it shouldn't (threshold too low) or rejecting legitimate queries falling into the FAQ fallback (threshold too high).

That's manual calibration. And it has a cost that gets worse over time.

When the encoder changes - new embedding model, new Azure OpenAI version, provider migration - the score distribution changes. Thresholds that worked with text-embedding-3-small don't work with text-embedding-3-large. You have to recalibrate from scratch.

When the utterance catalog grows, adding more positive examples per intent and more semantic diversity, the distribution changes again.

CFG_PARAMS in Redis solves the operation of adjusting without deploying. It doesn't solve the discovery of the right value.

When to use

Macro classification across well-defined intents (5-20 intents)
Latency is critical (less than 50ms per request)
You need hot reload of configuration without deploy
The domain has pairs of intents that confuse in known, predictable ways
You have positive and negative examples per intent, but not enough labeled data for an LLM

Router 2: DSPy

What it is

An LLM-first router that uses a declarative Signature to classify intent and extract structured parameters in a single inference, compiled with supervised examples and an asymmetric evaluation metric.

What concretely sets it apart

Same input query produces completely different output. DSPy doesn't just classify: it extracts every parameter the search tool needs to execute the query in Redis Stack.

Doing this with embeddings is unfeasible. You can't extract price_max=300.0 from a cosine vector.

The asymmetric metric: where business priority becomes code

BootstrapFewShot rejects any demo with wrong scope, regardless of how many fields were right. That's unfeasible to implement in an embedding-based semantic router, where the concept of scope as a zero-tolerance criterion doesn't exist in cosine similarity.

How DSPy solves the calibration problem

In DSPy, the threshold concept doesn't exist the same way, because the decision isn't "is the score above X?". The decision is "what's the best RouterOutput given the text, history, and compiled demos?".

What works as calibration is the metric itself with BootstrapFewShot. The optimizer, by selecting which demos to include in the compiled prompt, is implicitly calibrating the router's behavior to the domain - without you defining a threshold explicitly. If you add new examples to the dataset and recompile, the model recalibrates with the new data. The limitation: recompiling has LLM cost. It's not an operation you do as hot reload in production. It's offline, controlled, with validation before deploy.

Coercion: protection against LLM unpredictability

The semantic router never needs this. Embedding + dot product always returns a float. An LLM can return markdown, malformed JSON, or free text. DSPy handles this via coercion rules in the compiled prompt.

When to use

Routing requires structured parameter extraction beyond intent classification
You have a supervised dataset (50+ examples already help)
You need portability across models (swap GPT-4o for Claude without rewriting)
Router output needs a verifiable, versionable contract
2-3s latency per call is acceptable

Router 3: Aurélio AI semantic-router (evaluated, not implemented)

Before building the custom router, I evaluated the Aurelio AI open-source library, the group that includes part of the original semantic-router team. The API is clean and the mental model is the same: utterances per route, configurable encoder, cosine similarity. The library works. What made me build the custom one were three gaps specific to my use case:

No configurable pairwise reranker: for pairs of intents that confuse in the domain (busca_catalogo vs detalhe_produto when the user mentions a specific product), there's no structured mechanism to inject tie-breaking logic. Dynamic Routes (with LLM) are the alternative, but they add latency I didn't want on the critical path.
No hot reload of configuration: to calibrate thresholds, add FAQ force terms, or adjust ambiguous intent behavior in production, I'd need a new deploy. CFG_PARAMS in Redis was a project requirement.
Reduced visibility into the decision pipeline: top-k with scores for all candidates, the pairwise reranker with its rule, the disambiguation_rule in the structured log... these items matter for debugging degradation in production. The library returns RouteChoice with similarity_score. Enough for many cases, not for the auditability level I needed.

What the library solves better: automatic threshold calibration

Where Aurélio AI has a concrete advantage over the custom one is exactly the calibration problem above. The library has an automatic fitting method via labeled dataset. Fitting uses similarity scores themselves to find the optimal cutoff per intent, no manual trial and error. When the encoder changes, you run fitting again with the same dataset and the thresholds recalibrate automatically.

The limitation: fitting is only as good as the dataset. If it doesn't cover the domain's edge cases, the thresholds will be optimal for what you tested and silently bad for what you didn't.

For my case, that gain didn't compensate for the gaps in the other three points. For a project with no hot-reload requirement and semantically well-separated intents, it can be enough.

When it makes sense

Quick adoption with simple, semantically well-separated intents
Low volume and few confusing intent pairs
Automatic threshold calibration via dataset is a higher priority than pipeline control
Small team without capacity to maintain a custom routing pipeline
Dynamic Routes are sufficient for edge cases

Calibration comparison

The threshold calibration problem appears in different forms across the three routers. The custom semantic router puts the calibration work on the engineer. Aurélio AI automates it but still depends on training-set quality. DSPy eliminates it as an explicit problem, but turns it into another: maintaining a supervised dataset and a recompilation process. None of the three is maintenance-free. The difference is where the work lives.

How the two coexist in production

In the assistant, the two routers exist in different layers:

Layer 1 - Custom semantic router in the supervisor: decides whether the message is catalog search, product detail, support, order tracking, or out of scope. Macro classification, no LLM, extremely low latency.

Layer 2 - DSPy router inside the search subgraph: when Layer 1 routed to busca_catalogo, DSPy decides which tool to call and with which parameters. This is where "men's Nike sneakers up to $300 with free shipping" becomes structured parameters for search execution. The split isn't by preference, it's by problem. Layer 1 needs speed and hot reload. Layer 2 needs structured extraction with a verifiable contract.

Trade-offs the table doesn't capture

Silent degradation: the custom semantic router degrades when utterances become outdated relative to actual user vocabulary. You detect it through rising fallback rates. DSPy degrades when the provider's model is updated; you detect it by running the validation set before deploy. Aurélio AI degrades the same way as the custom one, but you have fit() to recalibrate.
Real operational cost: the custom semantic router charges embedding per request (cheap and predictable). DSPy charges embedding + LLM, but extracting parameters in the same inference eliminates the subsequent LLM call to parse the query - net cost is sometimes comparable. Aurélio AI has the same cost as the custom one outside Dynamic Routes.
Knowledge transfer: the custom semantic router has many lines of decision pipeline. Each line is a decision with a reason. When someone new joins the project, that context has to be transferred explicitly. DSPy has a compiled JSON and a dataset; the "why" is in the examples and weights. Aurélio AI has public documentation.

Summary

Custom semantic router: speed, full pipeline control, and hot reload. The price is manual threshold calibration and high maintenance cost.
DSPy: when the routing decision needs structured parameter extraction, with verifiable contract, model portability, and implicit calibration via compilation.
Aurélio AI: when you want the semantic router mental model without building the pipeline from scratch, with automatic calibration via fit(). Evaluate whether the absence of hot reload and pairwise reranker is acceptable for your use case before choosing.