Observability in a LangGraph graph: what Langfuse sees that the log doesn't

Logs cover what happened inside each node. They don't answer 'did the fallback rate climb in the last 30 minutes?'. For that, Langfuse.

8 APR 2026·4 min read·Observability / Langfuse / LangGraph / DSPy

In previous posts I showed how DSPy classifies intent and how the custom embedding-based router decides macro intent. Both share one trait: they can degrade silently.

DSPy degrades when the provider updates the model. The semantic router degrades when user vocabulary changes. Neither raises an exception when this happens. You discover it through response quality (or user complaint).

Observability is what changes that. This post is about instrumenting a LangGraph graph with Langfuse to track exactly where each decision happened and how to detect degradation before it reaches the user.

What structured logging doesn't solve

The structured logger covers what happened inside each node. It doesn't answer questions like:

Which scope did DSPy route to in the last 500 queries? What's the distribution?
Did the fallback rate to geral climb in the last 30 minutes?
How many responses did critique_node intervene in today? By which violation type?
In how many requests did dspy.Refine need more than one attempt?

These questions require an aggregated view across multiple executions, not the log of a single execution. That's where Langfuse comes in.

Base instrumentation: @observe on nodes

The simplest entry point is the @observe decorator on graph nodes:

capture_input=False and capture_output=False aren't paranoia. In any system with user data, capturing the full graph state in Langfuse means capturing conversation history, session identifiers, and potentially personal data. What you want in the trace isn't the state - it's the decision metadata.

update_trace_metadata: injecting context into the active trace

Inside any node, update_trace_metadata accumulates metadata in the current request's trace:

Each node contributes its metadata to the same trace. In Langfuse, you see the full trace of a request with all those fields aggregated, without cross-referencing logs from different workers.

Tags and user_id: segmentation without personal data

In the supervisor, when assembling the graph configuration:

The langfuse_user_id is an internal identifier - never a national ID, email, or phone number. Tags let you filter traces by channel, environment, and degraded mode in the dashboard without exposing personal data.

RouterMetricsCollector: in-process metrics with Redis backend

Langfuse tracks individual executions. For aggregated real-time metrics - fallback rate, scope distribution, anaphora hits - the system has its own collector:

The Redis backend uses HINCRBY, compatible with multi-worker deploys where each process has its own in-process counter but all write to the same HASH:

The increment is fire-and-forget: it doesn't block router_node's hot path. If Redis is unavailable, the in-process counter still works, and the fallback alert still fires.

What to monitor to detect DSPy degradation

In router_node, after each inference:

The three indicators that matter most:

Fallback rate to geral: when DSPy can't classify the query, it returns scope="geral". A rate above 25% signals that compiled demos are outdated relative to user vocabulary. Fix: expand the dataset and recompile.
Anaphora resolution rate: if anaphora_hit_count/total drops abruptly, the regex patterns of the anaphora resolver stopped matching user messages. Fix: review the patterns.
Forced coercion rate: if coercion_fallback_count rises, the LLM started returning output formats the coercion layer can't parse. Fix: inspect dspy.inspect_history() and revise the Signature.

What Langfuse shows that the log doesn't

With update_trace_metadata on every node, Langfuse aggregates into a single trace:

In 5 seconds of execution you know: the anaphora was resolved, DSPy routed to catalog search, Refine needed 2 attempts, and critique injected a missing disclaimer.

Without that trace, you have 5 logs in different files, with no direct correlation, no timeline.

Important items

capture_input=False on every span that touches user messages. capture_output=False on every span that returns graph state.
What goes to Langfuse is decision metadata - scope, selected tool, latency, compliance flags - not the data itself.
User data belongs in the legal archive. Decision metadata belongs in the observability system. Mixing the two creates privacy issues and inflates Langfuse cost without adding debugging value.

Next week: how the generation module uses runtime self-correction with a reward function, and the latency trade-off it creates.