If DSPy is so great, why isn't anyone using it?

Posted by sbpayne 9 hours ago

If DSPy is so great, why isn't anyone using it?(skylarbpayne.com)

196 points | 110 commentspage 4

Silamoth 5 hours ago|

Am I the only one disappointed this was about some LLM slop and not digital signal processing? DSP is a well-established technical acronym, so I expected to hear about a new Python DSP library. Oh well.

LoganDark 8 hours ago||

This article seemingly misses any explanation of what DSPy even is or why it's supposedly so complicated and unfamiliar. Supposedly it solves the problems illustrated in the article, but it isn't explained how.

sbpayne 8 hours ago|

Great feedback! I took for granted that people reading would be familiar with what Dspy is. I'll try to add this in tonight to introduce folks better. Thank you!

simopa 8 hours ago||

"Great engineers write bad AI code" made my day ;)

sbpayne 8 hours ago|

hahaha this has just been my entire last few years of experience :)

tilt 8 hours ago||

Curious what you think of https://github.com/pipevals/pipevals (author)

sbpayne 8 hours ago|

I have never heard of this! I took a quick look. I think I'm definitely not in the right audience for a tool like this, as I am more comfortable just writing code. But I think putting a UI over things like this _forces_ the underlying system to be more declarative...

So in practice I imagine you get at a lot of the same ideas / benefits!

dzonga 8 hours ago||

at /u/ sbpyane - very useful info and pricing page as well.

useful for upcoming consultants to learn how to price services too.

sbpayne 8 hours ago|

Highly recommend following @jxnl on X for consulting / positioning / pricing

msp26 7 hours ago||

> Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer.

Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.

My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.

Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.

> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.

sbpayne 7 hours ago||

This is very true! I could have been more careful/precise in how I worded this. I was really trying to just get across that it's in a sense easier than some tasks that can be much more open ended.

I'll think about how to word this better, thanks for the feedback!

sethkim 7 hours ago|||

This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective.

rco8786 7 hours ago||

I think they're just saying that data extraction tasks are easy to evaluate because for a given input text/file you can specify the exact structured output you expect from it.

AIorNot 6 hours ago||

I kind of like BAML https://boundaryml.com/ been using it in production

Edit, read the article -its really good- that cycle of AI engineering progression is spot on -read the article too!

TZubiri 8 hours ago||

>"Stage 2: “Can we tweak the prompt without deploying?”

Are we playing philosophy here? If you move some part of the code from the repo and into a database, then changing that database is still part of the deployment, but now you just made your versioning have identity crisis. Just put your prompts in your git repo and say no when someone requests an anti-pattern be implemented.

sbpayne 8 hours ago|

I think the core challenge here is that being able to (in "development") quickly change the prompt or other parameters and re-run the system to see how it changes is really valuable for making a tight iteration loop.

It's annoying/difficult in practice if this is strictly in code. I don't think a database is necessarily the way to go, but it's just a common pattern I see. And I really strongly believe this is more of a need for a "development time override" than the primary way to deploy to production, to be clear.

markab21 8 hours ago||

I think the entire premise that the prompting is the surface area for optimizing the application is fundamentally the wrong framing, in the same way that in 1998 better cpam will save CGI. It's solving the wrong problems now, and the limitations in context and model intelligence require a tool like Dspy.

The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.

My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)

sbpayne 8 hours ago|

Yeah, I find it hard to recommend Dspy. At the same time, I can't escape the observation that many companies are re-implementing a lot of parts of it. So I think it's important to at least learn from what Dspy is :)

villgax 8 hours ago|

Nobody uses it except for maybe the weaviate developer advocates running those jupyter cells.

More comments...