2025-09-14 dslengineeringopinion

Why prompts deserve a DSL, not f-strings

F-strings work until they don't. Once a prompt has params, defaults, techniques, and provider constraints, it stops being a string and starts being a program. Give it a grammar.

There is a moment in every LLM project — usually around month three — when somebody opens a file called prompts.py and realises it is no longer a module. It is a swamp. There are nested f-strings inside f-strings. There is a function whose entire body is one return statement of conditional string concatenation. There is a comment that says # DO NOT REMOVE THIS NEWLINE and nobody dares to test the theory.

This is the natural endpoint of treating prompts as strings. The fix is not better discipline. The fix is to admit that prompts are programs, and programs deserve a grammar.

The argument against f-strings

Start with what an f-string actually gives you. It interpolates values into a template. That is the entire feature. It does not validate that the values exist. It does not type-check them. It does not version the template. It does not distinguish the prompt from the technique applied to the prompt. It does not have an opinion about which provider you are talking to.

For a hello-world script, this is correct. For one prompt and one model, the f-string is more than enough. The trouble starts when the prompt grows responsibilities. You add a default for one parameter. You inline a few-shot example. You decide that under some condition, the chain-of-thought should fire and under others it should not. You realise the same prompt needs to run against Anthropic when OpenAI is down. By the time you finish, the f-string has accreted a control structure around it. That control structure lives in Python or TypeScript, mixed with your business logic, untested, and undiffable in any meaningful way.

What you have built is a small interpreter for a prompting language, written ad-hoc in the host language, with no name and no documentation. Every team builds this interpreter eventually. Every team gets it wrong in different places.

What a DSL actually buys

A DSL is not a vibe. It is a small, focused grammar with a parser, an AST, and a runtime that walks the AST. When you write a prompt in a DSL, you stop describing “what the string looks like” and start describing “what the prompt is doing.” Those are not the same thing.

Concretely, a prompt DSL lets you name the shapes you have been informally pretending exist:

A params block, with types, defaults, and required-or-not markers. The parser refuses to ship a prompt that references an unbound name.
A body block, with template literals that interpolate the params. The literal is text; the interpolation is a typed expression.
A technique block, where Chain-of-Thought or Few-Shot or Tree-of-Thoughts is a node in the AST, not a copy-pasted recipe. You can ask the AST “which technique is this prompt using” and get an answer.
A constraints block, where maxTokens and temperature are first-class fields, not magic numbers passed to whatever SDK you happened to use.
A meta block, where the name and the version live, so a CI job can fail a PR that ships a prompt without a version bump.

None of this is exotic. It is the same separation of concerns you already enforce in your application code, applied to the strings you have been pretending are not code.

Reviewability is the killer feature

Most arguments for a DSL focus on runtime: faster, safer, more portable. Those are real. But the argument that wins inside an engineering org is reviewability.

When a prompt is an f-string, a PR review looks like this: “you changed three words in a 40-line template literal, and I cannot tell what those words do.” When a prompt is a structured document, the same PR review looks like this: “you tightened the Security step of the Chain-of-Thought, lowered the temperature from 0.4 to 0.2, and added a default for the optional language param.” One of those reviews can be done in a hurry. The other one cannot be done at all.

The same applies to evals. Eval harnesses want to enumerate prompts, group them by technique, swap providers, and report deltas. They cannot do any of that against a soup of conditional string concatenation. They can do all of it against an AST.

”But I already have a templating engine”

You do. Jinja2 is excellent. Mustache is fine. Handlebars works. But templating engines solve a different problem. They render text from a model. They do not have opinions about params, techniques, constraints, or providers. You can build all of that on top of a templating engine — and you will, badly, because every team’s “thin wrapper on top of Jinja” ends up different.

A prompt DSL is a templating engine plus a vocabulary. The vocabulary is the load-bearing part. Once everybody in the team agrees that “Chain-of-Thought” is a block called chainOfThought with step("name") { ... } children, the prompt becomes a thing you can talk about over a video call without sharing a screen.

What about YAML?

YAML is the other answer to “prompts deserve structure”, and it is not wrong. YAML gives you the same separation of concerns as a DSL, with the advantage that every CI runner on Earth already knows how to parse it. The disadvantage is that YAML hates expressions. The moment your default value is a small piece of logic, or your step text needs an interpolation, YAML starts forcing you into magic strings that get evaluated later — which is, again, the f-string problem in a different costume.

The pragmatic answer is to support both, with one AST underneath. Author in the DSL when you care about readability and expressions. Emit YAML when you want to validate prompts in a CI job that does not want a JavaScript runtime. Convert in either direction with a single command. That is the bet promptel is making, and it has been a quietly correct bet so far.

What you give up

A DSL has a learning curve, even a small one. Anybody on your team who can read JavaScript can read promptel in about ten minutes, but they do have to spend those ten minutes. The DSL also has to be implemented somewhere — a parser, a runtime, a CLI — which is a maintenance surface you did not have before. If you adopt one, you are betting that the team behind it stays alive long enough for the bet to pay off.

In exchange, you get a prompt that can be reviewed in a PR, diffed in git, validated in CI, swapped between providers without surgery, and evaluated without scraping a Python module for string literals. That trade has only ever made sense in one direction.

The honest summary

Prompts are programs that happen to be made of words. Treat them like the strings they look like, and you will end up writing a worse version of every tool you have ever used for programs — version control, typing, review, evals, tests, observability. Treat them like programs, and the existing tools start to fit.

The cost of a DSL is small. The cost of pretending you don’t need one compounds with every prompt you ship.

More notes Docs ↗