If DSPy is so great, why isn't anyone using it?

Posted by sbpayne 4 hours ago

If DSPy is so great, why isn't anyone using it?(skylarbpayne.com)

172 points | 104 comments

deaux 4 hours ago|

I don't see it at all.

> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.

Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.

> Separate prompts from code. Forces you to think about prompts as distinct things.

There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.

> Composable units. Every LLM call should be testable, mockable, chainable.

> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

And LiteLLM or `ai` (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.

> Eval infrastructure early. Day one. How will you know if a change helped?

Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.

But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.

alexjplant 2 hours ago||

> Sure, not related to DSPy though, and completely tablestakes.

I agree but you'd be surprised at how many people will argue against static typing with a straight face. It's happened to me on at least three occasions that I can count and each time the usual suspects were trotted out: "it's quicker", "you should have tests to validate anyhow", "YOLO polymorphism is amazing", "Google writes Python so it's OK", etc.

It must be cultural as it always seems to be a specific subset of Python and ECMAScript devs making these arguments. I'm glad that type hints and Typescript are gaining traction as I fall firmly on the other side of this debate. The proliferation of LLM coding workflows has likely accelerated adoption since types provide such valuable local context to the models.

callbacked 1 hour ago|||

> not sure why the whole article assumes the only language in the world is Python

https://github.com/ax-llm/ax (if you're in the typescript world)

andyg_blog 3 hours ago|||

>the whole article assumes the only language in the world is Python.

This was my take as well.

My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.

sbpayne 3 hours ago|||

I think this is an important point! I am actually a big fan of doing what works in the language(s) you're already using.

For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.

And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"

redwood 1 hour ago||

Out of curiosity, what are you finding success with in dotnet land? My observation is that it's not clear when Semantic Kernel is recommended versus one of multiple other MSFT newly-branded creations

sbpayne 1 hour ago||

we have been using Agent Framework. I also have been eyeing LlmTornado. Personally, I find dotnet as a whole hard to implement the kind of abstractions I want to have to make it ergonomic to implement AI stuff.

I've been fiddling around with many prototypes to try to figure out the right way to do this, but it feels challenging; I'm not yet familiar enough with how to do this ergonomically and idiomatically in dotnet haha

BoorishBears 2 hours ago|||

Why did you do that instead of using Liquid templates?

sbpayne 3 hours ago|||

I think all of these things are table-stakes; yet I see that they are implemented/supported poorly across many companies. All I'm saying is there are some patterns here that are important, and it makes sense to enter into building AI systems understanding them (whether or not you use Dspy) :)

PaulHoule 2 hours ago|||

I can say for 10 years I have been looking at general purpose frameworks like Dspy and even wrote one at work and they tend to be pretty bad, especially the one I wrote.

I agree with all the points that they list but I fear if I looked close at the code and how they did it I wouldn't stop cringing until I looked away. Frameworks like this tend to point out 10 concerns that you should be concerned about but aren't and make users learn a lot of new stuff to bend their work around your framework but they rarely get a clear understanding of what the concerns are, where exactly the value comes from the framework, etc.

That is, if you are trying to sell something you can do a lot better with something crazy and one-third-baked like OpenClaw, which will make your local Apple Store sell out of minis, than anything that rationally explains "you are going to have to invent all the stuff that is in this framework that looks like incomprehensible bloat to you right now." I mean, it is rational, it is true, but I can say empirically as a person-who-sells-things that it doesn't sell, in fact if you wanted me to make a magic charm that looks like it would sell things and make sure you don't sell anything it would be that.

sbpayne 1 hour ago||

yeah the point I want to get across is less "you should use Dspy" and more "understand Dspy, so you are intentionally implementing the capabilities you need"

Implementations are generally always going to be messy; and still I feel like not all the messiness is incidental. A lot of it is accidental :)

persedes 3 hours ago|||

Dspys advertising aside, imho it is a library only for optimizing an existing workflow/ prompt and not for the use cases described there. Similar to how I would not write "production" code with sklearn :)

They themselves are turning into wrapper code for other libraries (e.g. the LLM abstraction which litellm handles for them).

Can also add:

Option 3: Use instructor + litellm (probabyly pydantic AI, but have not tried that yet)

Edit: As others pointed out their optimizing algorithms are very good (GEPA is great and let's you easily visualize / track the changes it makes to the prompt)

prpl 3 hours ago||

The sklearn to me is (and mirrors) the insane amount of engineering that exists/existed to bring Jupyter notebooks to something more prod-worthy and reproducible. There’s always going to be re-engineering of these things, you don’t need to use the same tools for all use cases

persedes 2 hours ago||

Hmm not quite what I meant. Sklearn has it's place in every ML toolbox, I'll use it to experiment and train my model. However for deploying it, I can e.g. just grab the weights of the model and run it with numpy in production without needing the heavy dependencies that sklearn adds.

hedgehog 3 hours ago||

In my experience the behavior variation between models and providers is different enough that the "one-line swap" idea is only true for the simplest cases. I agree the prompt lifecycle is the same as code though. The compromise I'm at currently is to use text templates checked in with the rest of the code (Handlebars but it doesn't really matter) and enforce some structure with a wrapper that takes as inputs the template name + context data + output schema + target model, and internally papers over the behavioral differences I'm ok with ignoring.

I'm curious what other practitioners are doing.

dbreunig 3 hours ago||

Model testing and swapping is one of the surprises people really appreciate DSPy for.

You're right: prompts are overfit to models. You can't just change the provider or target and know that you're giving it a fair shake. But if you have eval data and have been using a prompt optimizer with DSPy, you can try models with the one-line change followed by rerunning the prompt optimizer.

Dropbox just published a case study where they talk about this:

> At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.

https://dropbox.tech/machine-learning/optimizing-dropbox-das...

hedgehog 9 minutes ago|||

It's not just about fitting prompts to models, it's things like how web search works, how structured outputs are handled, various knobs like level of reasoning effort, etc. I don't think the DSPy approach is bad but it doesn't really solve those issues.

persedes 2 hours ago|||

funnily enough the model switching is mostly thanks to litellm which dspy wraps around.

nkozyra 4 hours ago||

> f"Extract the company name from: {text}"

I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.

I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.

roadside_picnic 2 hours ago||

> like entity recognition

As someone who has done traditional NLP work as at least part of my job for the last 15 years, LLMs do ofter a vastly superior NER solution over any previous NLP options.

I agree with your overall statement, that frequently people rush to grab an LLM when superior options already exist (classification is a big example, especially when the power of embeddings can be leveraged), but NER is absolutely a case where LLMs are the superior option (unless you have latency/cost requirements to force you to choose and inferior quality as the trade off, but your default should be an LLM today).

mark_l_watson 29 minutes ago||

I agree! I used 'symbolic AI' for NLP starting in the early 1980s. Everything back then was so brittle, and very labor intensive.

sbpayne 4 hours ago|||

Oh 100%! There are many problems (including this one!) that probably aren't best suited for an LLM. I was just trying to pick a really simple example that most people would follow.

rao-v 3 hours ago|||

Is there a non-tranformer based entity extraction solution that's not brittle? My understanding is that the cutting edge in entity extraction (e.g. spaCy) is just small BERT models, which rock for certain things, but don't have the world knowledge to handle typos / misspellings etc.

swyx 1 hour ago|||

but then u run into edge cases with indirect references and entity recognition models arent smart enough to deal with them, and bitter lesson hits you again.

sbpayne 1 hour ago||

the bitter lesson comes for us all, unfortunately!

Legend2440 1 hour ago||

I don't think you realize how bad NLP was prior to transformers. Oldschool entity recognition was extremely brittle to the point that it basically didn't work.

CV too for that matter, object recognition before deep learning required a white background and consistent angles. Remember this XKCD from only 2014? https://xkcd.com/1425/

nkozyra 1 hour ago||

CV is a space where I would 100% agree with you. But - edge cases notwithstanding - there's not so much of a dropoff with NER that I would first go to an LLM.

memothon 4 hours ago||

I think the real problem with using DSPy is that many of the problems people are trying to solve with LLMs (agents, chat) don't have an obvious path to evaluate. You have to really think carefully on how to build up a training and evaluation dataset that you can throw to DSPy to get it to optimize.

This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.

This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!

sbpayne 4 hours ago|

Yeah, I think Dspy often does not really show it's benefit until you have a good 'automated metric', which can be difficult to get to.

I think the unfortunate part is: the way it encourages you to structure your code is good for other reasons that might not be an 'acute' pain. And over time, it seems inevitable you'll end up building something that looks like it.

memothon 4 hours ago||

Yeah I agree with this. I will try to use it in earnest on my next project.

That metric is the key piece. I don't know the right way to build an automated metric for a lot of the systems I want to build that will stand the test of time.

sbpayne 3 hours ago||

To be clear: I don't know that I would recommend using it, exactly. I would just make sure you understand the lessons so you see how it best makes sense to apply to your project :)

LudwigNagasena 1 hour ago||

The article starts with the comparison of DSPy and LangChain monthly downloads and then wastes time comparing DSPy to hand-rolling basic infra, which is quite trivial in every barely mature setup.

I conjecture that the core value proposition of DSPy is its optimizer? Yet the article doesn't really touch it in any important way. How does it work? How would I integrate it into my production? Is it even worth it for usual use-cases? Adding a retry is not a problem, creating and maintaining an AI control plane is. LangChain provides services for observability, online and offline evaluation, prompt engineering, deployment, you name it.

sbpayne 1 hour ago|

You can see many people saying this in the comments :). I personally think this misses the core of what Dspy "is".

Dspy encourages you to write your code in a way that better enables optimization, yes (and provides direct abstractions for that). But this isn't in a sense unique to Dspy: you can get these same benefits by applying the right patterns.

And they are the patterns I just find people constantly implementing these without realizing it, and think they could benefit from understanding Dspy a bit better to make better implementations :)

stephantul 4 hours ago||

Mannnn, here I thought this was going to be an informative article! But it’s just a commercial for the author’s consulting business.

sbpayne 4 hours ago||

Oops! That's actually out of date from prior template I had. I don't actually consult at the moment :). Removing!

halb 3 hours ago||

The author itself is probably ai-generated. The contact section in the blog is just placeholder values. I think the age of informative articles is gone

CharlieDigital 3 hours ago|||

I work with author; author is definitely not AI generated.

sbpayne 3 hours ago|||

This is definitely a mistake! What contact section are you referring to? The only references to contact I see in this post now are at the end where I linked to my X/LinkedIn profiles but those links look right to me?

TheTaytay 4 hours ago||

I tried it in the past, one time “in earnest.” But when I discovered that none of my actual optimized prompts were extractable, I got cold feet and went a different route. The idea of needing to do fully commit to a framework scares me. The idea of having a computer optimize a prompt as a compilation step makes a lot of sense, but treating the underlying output prompt as an opaque blob doesn’t. Some of my use cases were JUST off of the beaten path that dspy was confusing, which didn’t help. And lastly, I felt like committing to dspy meant that I would be shutting the door on any other framework or tool or prompting approach down the road.

I think I might have just misunderstood how to use it.

sbpayne 4 hours ago|

I don't know that you misunderstood. This is one of my biggest gripes with Dspy as well. I think it takes the "prompt is a parameter" concept a bit too far.

I highly recommend checking out this community plugin from Maxime, it helps "bridge the gap": https://github.com/dspy-community/dspy-template-adapter

benh2477 13 minutes ago||

The adoption gap feels real. My experience is that developers don't trust AI outputs enough to build production workflows around them yet — the missing piece isn't better prompting frameworks, it's confidence signals that tell you when to trust the output.

alex7o 14 minutes ago||

I have used baml before and that worked super well for me multiple times so I don't see a problem with that.

tech_hutch 15 minutes ago||

I read the title as "If DarkSydePhil-y is so great, why isn't anyone using it?"

giorgioz 3 hours ago|

Loved the article because I exactly hit the stages all up till the 5th! Thank you for making me see the whole picture and journey!

I think a problem to DSPy is that they don't know the concept of THE WHOLE PRODUCT: https://en.wikipedia.org/wiki/Whole_product

Look at https://mastra.ai/ and https://www.copilotkit.ai/ to see how more inviting their pages look. A company is not selling only the product itself but all the other things around the product = THE WHOLE PRODUCT

A similar concept in developer tools is the docs are the product

Also I'm a fullstack javascript engineer and I don't use Python. Docs usually have a switch for the language at the top. Stripe.com is famous for it's docs and Developer Experience: https://docs.stripe.com/search#examples It's great to study other great products to get inspiration and copy the best traits that are relevant to your product as well.

sbpayne 3 hours ago|

The "whole product" idea here makes a lot of sense to me. I think this is often a big barrier to adoption for sure!

More comments...