DSPy, the framework that treats prompts as compilable code, not as strings

Instead of writing prompts, you program declarative modules and let the framework compile optimized prompts - based on data, metrics, and the model you're using.

14 MAR 2026·6 min read·DSPy / Prompt Engineering / LLM / Stanford NLP

You know when the model version is going to be deprecated, or when you have to switch the model to another provider, and that means you have to adjust your prompts to make them work? One of DSPy's purposes is to help with that pain.

The problem DSPy solves

Most LLM applications today depend on manual prompt engineering. You write a long string, test, tweak a word, test again, add an example or two, pray, and keep adjusting until it works. Until the model changes versions, or you switch providers, or the use case grows beyond the scenarios you tested.

DSPy (dspy.ai) proposes a shift: instead of writing prompts, you program declarative modules and let the framework compile optimized prompts automatically, based on data, metrics, and the model you're using.

The analogy the Stanford NLP team itself uses: DSPy is to LLMs what PyTorch is to neural networks. You define the architecture (modules), define the loss (metric), and the compiler (optimizer) figures out the best weights - which in the case of LLMs are instructions and few-shot examples.

Who created it and where it lives

DSPy was born at the Stanford NLP Lab in February 2022, evolved from the DSP project (Demonstrate-Search-Predict), and took its current shape in October 2023. Official docs at dspy.ai, source at github.com/stanfordnlp/dspy, active community on Discord.

The framework is open-source, has more than 250 contributors, and has already inspired a wave of academic work and derived tools. If you work with LLM pipelines in production, it's worth investing time understanding the concepts, even if you don't adopt the whole framework.

The 4 fundamental concepts

1. Signatures: the I/O contract

A Signature defines what the module does, not how. It's the declarative interface between your code and the LLM.

In its simplest form:

"question -> answer"

Or with types and descriptions:

The key point: you don't write the prompt. The Signature describes the contract, and DSPy assembles the prompt automatically - including instructions, formatting, and few-shot examples if compiled.

The Signature's docstring is more important than it looks. It becomes part of the prompt sent to the LLM. Explicit rules there directly influence classification.

2. Modules: the execution strategy

Modules define how the LLM should process the Signature. The main ones:

dspy.Predict: direct call, no intermediate reasoning;
dspy.ChainOfThought: forces the model to reason step by step before answering;
dspy.ReAct: an agent that can use tools (search, code, APIs) in a loop;
dspy.Retrieve: vector-store retrieval integrated into the pipeline.

You compose modules in regular Python classes:

This is code. Testable, debuggable, versionable. Not a 200+ line string pasted into a template.

3. Optimizers: the prompt compiler

This is where DSPy radically differs. The Optimizers (formerly called Teleprompters) compile your program by automatically generating the best instructions and few-shot examples for each module.

The main ones:

BootstrapFewShot: uses your own program as the teacher. It runs trainset examples, validates with your metric, and keeps the passing ones as few-shot demos in the compiled prompt;
BootstrapFewShotWithRandomSearch: same as above, but tests multiple combinations of demos and selects the best on the validation set;
MIPROv2: the most sophisticated. Generates instructions and few-shot examples jointly, using Bayesian Optimization to search the best combination. Data-aware and demonstration-aware;
COPRO: focuses exclusively on optimizing instructions (no few-shot), using hill-climbing;
BootstrapFinetune: goes beyond the prompt - generates datasets and fine-tunes the model's weights.

In practice, the flow is:

After compiling, the module loads the JSON and uses the optimized demos on every call. No recompiling per request.

4. Metrics: the loss function

For the optimizer to work, you need a metric that says whether this output is good or not. It can be boolean (True/False) or numeric (0.0 to 1.0).

The optimizer uses this metric to decide which demos to keep and which instructions to generate. The more granular and representative the metric, the better the result of compilation.

What you can build with DSPy

Structured classifiers: receive free text, return validated Pydantic objects (intent, category, filters). No regex, no fragile parser.
Optimized RAG pipelines: official docs show 10% gains in SemanticF1 optimizing RAG over StackExchange with MIPROv2.
Agents with tools: dspy.ReAct orchestrates tool calls in a loop, with automatic optimization of the agent's instructions.
Guardrails and compliance: modules that audit the output of other modules, with Signatures specific to detecting violations.
Cognitive routers: classify ambiguous queries and route to the right tool/pipeline, with structured Pydantic-validated output.

When it's worth it and when it isn't

Worth it when:

You have multiple LLM modules in a pipeline (not a simple chat);
You need structured, validated output (classification, extraction, routing);
You want portability between models (swap GPT for Claude without rewriting prompts);
You have training/validation data (even small - 50 examples already help);
You need reproducibility (the compiled JSON is deterministic for the same optimizer version).

Not worth it when:

It's a simple Q&A chatbot;
You have zero evaluation data;
The framework's learning cost isn't justified by the project size;
You need byte-by-byte control over the prompt (DSPy generates the prompt for you. If you need full control, it gets in the way more than it helps).

Practical caveats the docs don't emphasize

Compiled JSON is fragile: if you rename a Signature field and don't recompile, the module loads old demos with incompatible keys. Result: silently wrong classification, no visible error.
Trainset diversity matters more than volume: if your 100 training examples cover 80% of the same query pattern, the optimizer will overfit. Rare patterns need explicit representation.
The Signature docstring is load-bearing: DSPy injects the docstring into the prompt. Vague rules generate vague classifications. Explicit rules with examples directly in the docstring influence LLM behavior.
dspy.inspect_history() is your best friend. When the module errs, the first step is to see the actual prompt sent to the LLM. Without that, you're debugging in the dark.
Compiling costs money: each compilation makes N LLM calls (N = trainset size × attempts). With GPT-4o, a compilation can cost almost nothing or quite a bit, depending on the config.

Resources to start

Official docs: dspy.ai
Source: github.com/stanfordnlp/dspy
Original paper: "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines" (Khattab et al., Stanford NLP)
Cheatsheet: dspy.ai/cheatsheet - quick reference of all modules and optimizers

In the next post I'll show how I used DSPy in practice to build a cognitive router in a conversational assistant (switching the example domain from investments to e-commerce), classifying ambiguous user queries into structured scopes with validated Pydantic output, compiled with BootstrapFewShot and a field-weighted metric.