dspy.Refine: runtime self-correction without recompiling the model
DSPy outside offline mode: generate, evaluate against a reward function, and if it fails, try again - before critique_node steps in.
In the DSPy posts I showed how BootstrapFewShot compiles a router that classifies intent and extracts structured parameters. That's DSPy in offline mode: you compile once, version the JSON, and use it in production.
There's another way to use DSPy that doesn't involve compilation: dspy.Refine. It's runtime self-correction - the model generates a response, evaluates it against a reward function, and if it doesn't pass, tries again.
This post is about how the assistant's response generation module uses dspy.Refine to ensure output conformance without depending on the prompt alone.
The problem Refine solves
The generation module receives the context returned by tools (catalog data, product details) and produces the final response for the user.
With a simple dspy.Predict, the flow is: LLM receives context → generates response → delivers.
If the response violates a domain rule - like a buy imperative, omitting a mandatory disclaimer, using data not in the context - critique_node will intervene afterward. That adds latency and may still result in a fallback message if the violation is severe.
With dspy.Refine, the violation is detected and corrected inside the generation module itself, before reaching critique_node. The critique still exists as the final guardrail, but it receives responses that already passed through a self-correction layer.
The reward function: compliance as a runtime metric
The reward function is the heart of Refine. It evaluates the generated response and returns a score between 0.0 and 1.0. If the score is below the configured threshold, Refine generates a new response.
Each rule contributes 0.25 to the score. If the total score is below threshold=1.0, Refine generates a new response, where DSPy automatically injects feedback about which rule failed.
The module: Refine over a Predictor with attempt counter
The attempt counter exists for two reasons. The first is Langfuse: with advisor_refine_attempt and advisor_total_refine_attempts in the trace metadata, you know exactly in what proportion of requests Refine needed more than one call. The second is alerting: if total_attempts > 1 shows up in more than 20% of traces, the Signature's prompt needs adjustment.
The latency trade-off that needs monitoring
dspy.Refine(N=3, threshold=1.0) can make up to 3 LLM calls per request.
In practice, most requests pass on the first attempt, because the Signature prompt was already written to avoid the most common violations. But there are cases where the context returned by tools contains data that induces the LLM to generate phrases that fail one of the rules.
Monitoring in Langfuse filters by advisor_total_refine_attempts > 1:
- If that filter returns less than 5% of traces: Refine is working as a safety net, not habitual correction - all good.
- If it returns between 5% and 20%: investigate which rules fail most often. Likely the
Signatureneeds more explicit examples for those rules. - If it returns more than 20%: Refine is becoming habitual correction instead of safety net. That's a sign the
Signatureor prompt is misaligned with the data coming from tools.
The answer to high rates isn't to increase N - it's to improve the prompt so the first attempt passes, and use Refine only for the residual.
Refine vs critique_node: different layers, not redundant
A natural question: if dspy.Refine exists in generation, why is critique_node still there?
Refine evaluates the generated response against local rules - those in the reward function. critique_node goes further:
- Detects hallucinations by comparing the response to actual context returned by tools
- Applies regex for violations Refine doesn't cover (e.g. inappropriate FGC disclaimer for certain products)
- Has its own Circuit Breaker for DSPy infrastructure failures
Refine is the first barrier, inside the generation module. critique_node is the final barrier, outside it. They're complementary layers, not redundant.
Refine improves the critique pass rate, and critique guarantees nothing slips through even when Refine fails or is bypassed.
Next week: QueryBuilder - how to build FT.SEARCH queries from a Pydantic model, with escaping of special characters and fuzzy matching with tokenization rules.