Posted by vismit2000 14 hours ago
> Rule 5. Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
Always preferred Perlis' version, that might be slightly over-used in functional programming to justify all kinds of hijinks, but with some nuance works out really well in practice:
> 9. It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures.
>I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
-- Linus Torvalds
> -- Linus Torvalds
What about programmers
- for whom the code is a data structure?
- who formulate their data structures in a way (e.g. in a very powerful type system) such that all the data structures are code?
- who invent a completely novel way of thinking about computer programs such that in this paradigm both code and data structures are just trivial special cases of some mind-blowing concept ζ of which there exist other special cases that are useful to write powerful programs, but these special cases are completely alien from anything that could be called "code" or "data (structures)", i.e. these programmers don't think/worry about code or data structures, but about ζ?
This kind of exploration can be a really positive use case for AI I think, like show me a sketch of this design vs that design and let's compare them together.
My recommendation is to truly learn a functional language and apply it to a real world product. Then you’ll learn how to think about data, in its pure state, and how it is transformed to get from point A to point B. These lessons will make for much cleaner design that will be applicable to imperative languages as well.
Or learn C where you do not have the luxury of using high-level crutches.
Not sure if SoTA codegen models are capable of navigating design space and coming up with optimal solutions. Like for cybersecurity, may be specialized models (like DeepMind's Sec-Gemini), if there are any, might?
I reckon, a programmer who already has learnt about / explored the design space, will be able to prompt more pointedly and evaluate the output qualitatively.
> sometimes a barrier to getting started for me
Plenty great books on the topic (:
Algorithms + Data Structures = Programs (1976), https://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures...
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)
This is what I've observed with using AI on relatively small (~1000 line) programs. When I add a requirement that requires a different data structure, Claude will happily move to the new optimal data structure, and rewrite literally everything accordingly.
I've heard that it gets dicier when you have source files that are 30K-40K lines and programs that are in the million+ line range. My reports have reported that Gemini falls down badly in this case, because the source file blows the context window. But even then, they've also reported that you can make progress by asking Gemini to come up with the new design, and then asking it to come up with a list of modules that depend upon the old structure, and then asking it to write a shim layer module-by-module to have the old code use the new data structure, and then have it replace the old data structure with the new one, and then have it remove the shim layer and rewrite the code of each module to natively use the new data structure. Basically, babysit it through the same refactoring that an experienced programmer would use to do a large-scale refactoring in a million+ line codebase, but have the AI rewrite modules in 5 minutes that would take a programmer 5 weeks.
You don't need to be able to pass a leet code interview, but you should know about big O complexity, you should be able to work out if a linked list is better than an array, you should be able to program a trie, and you should be at least aware of concepts like cache coherence / locality. You don't need to be an expert, but these are realities of the way software and hardware work. They're also not super complex to gain a working knowledge of, and various LLMs are probably a really good way to gain that knowledge.
Bill Gates, for example, always advocated for thinking through the entire program design and data structures before writing any code, emphasizing that structure is crucial to success.
While developing Altair BASIC, his choice of data structures and algorithms enabled him to fit the code into just 4 kilobytes.
Microsoft is another story.
This is the worst sort of documentation; technically true but quite unenlightening. It is, in the parlance of the Fred Brooks quote mentioned in a sibling comment, neither the "flowchart" nor the "tables"; it is simply a brute enumeration of code.
To which the fix is, ask for the right thing. Ask for it to analyze the key data structures (tables) and provide you the flow through the program (the flowchart). It'll do it no problem. Might be inaccurate, as is a hazard with all documentation, but it makes as good a try at this style of documentation as "conventional" documentation.
Honestly one of the biggest problems I have with AI coding and documentation is just that the training set is filled to the brim with mediocrity and the defaults are inferior like this on numerous fronts. Also relevant to this conversation is that AI tends to code the same way it documents and it won't have either clear flow charts or tables unless you carefully prompt for them. It's pretty good at doing it when you ask, but if you don't ask you're gonna get a mess.
(And I find, at least in my contexts, using opus, you can't seem to prompt it to "use good data structures" in advance, it just writes scripting code like it always does and like that part of the prompt wasn't there. You pretty much have to come back in after its first cut and tell it what data structures to create. Then it's really good at the rest. YMMV, as is the way of AI.)
My interpretation of his point of view is that what you need is a process/interpreter/live object that 'explains' the data.
https://news.ycombinator.com/item?id=11945722
EDIT: He writes more about it in Quora. In brief, he says it is 'meaning', not 'data' that is central to programming.
One part of it has interesting new resonance in the era of agentic LLMs:
alankay on June 21, 2016 | root | parent | next [–]
This is why "the objects of the future" have to be ambassadors that can negotiate with other objects they've never seen. Think about this as one of the consequences of massive scaling ...
Nowdays rather than the methods associated with data objects, we are dealing with "context" and "prompts".
I should probably be thinking more in this direction.
Perlin: stringly typed logic is great!
I don't take it like that. A map could be the right data structure for something people typically reach for classes to do, and then you get a whole bunch of functions that can already operate on a map-like thing for free.
If you take a look at the standard library and the data structures of Clojure you'd see this approach taken to a somewhat extreme amount.
> Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
If I have learned one thing in my 30-40 years spent writing code, it is this.
You never know how requirements are going to change over the next 5 years, and pure structures are always the most flexible to work with.
You still have to worry about someone using kg when you use g, but you avoid a large class of problems and make your logic easier.
> 2. Functions delay binding; data structures induce binding. Moral: Structure data late in the programming process.
https://ocw.mit.edu/courses/6-001-structure-and-interpretati...
which I found very helpful in (finally) managing to get through that entire text (and do all the exercises).
What I am getting at is that when one has such gigantic data structure there is no separation of concerns.
Anytime one has access to a database one has access to one large global data structure that one can access from anywhere is a program.
This same concept goes for one's global state in one's game if one is making a game.
You keep saying you believe it, but that is literally what a database is, game state manipulation, string manipulation, iterator algorithms, list comprehensions, range algorithms, image manipulations, etc. These are all instances where you use the same data structures over and over with as many algorithms and functions and you need.
The key difficulty is identifying what these are is far from obvious upfront, and so often an index appears adjacent to a table that represents what the table should have been in the first place.
This is why I fundamentally find SQL too conservative and outdated. There are obvious patterns for cross-cutting concerns that would mitigate things like this but enterprise SQL products like Oracle and MS are awful at providing ways to do these reusable cross-cutting concerns consistently.
> Good programmers worry about data structures and their relationships.
> -- Linus Torvalds
I was specifically thinking about the "relationship" issues. The worst messes to fix are the ones where the programmer didn't consider how to relate the objects together - which relationships need to be direct PK bindings, which can be indirect, which things have to be cached vs calculated live, which things are the cache (vs the master copy), what the cardinality of each relationship is, which relationships are semantically ownerships vs peers, which data is part of the system itself vs configuration data vs live, how you handle changes to the data, (event sourcing vs changelogging vs vs append-only vs yolo update), etc.
Not quite "data structures" I admit but absolutely thinking hard about the relationship between all the data you have.
SQL doesn't frame all of these questions out for you but it's good getting you to start thinking about them in a way you might not otherwise.
That's great
When someone says "I want a programming language in which I need only say what I wish done," give him a lollipop.Pike is right.
I would guess Pike is simply wise enough not to get involved in such arguments.
“If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident.”
And:
“Use simple algorithms as well as simple data structures.”
A data structure general enough to solve enough problems to be meaningful will either be poorly suited to some problems or have complex algorithms for those problems, or both.
There are reasons we don’t all use graph databases or triple stores, and rely on abstractions over our byte arrays.
Let's say you're working for the DMV on a program for driver's licenses. The idea is to use one structure for driver's license data, as opposed to using one structure for new driver's licenses, a different one for renewals, and yet a third for expired ones, and a fourth one for name changes.
It is not saying that you should use byte arrays for driver's license records, so that you can use the same data structure for driver's license data and missile tracks. Generalize within your program, not across all possible programs running on all computers.
You do not write programs with one map of id to thing as you are suggesting here.
I'm a big fan of Data Oriented Design. Once you conceptualize how data is stored and transformed in your program, it just has to be reflected in data structures that make it possible.
Modern design approaches tend to focus on choosing a right abstraction like columnar/row layout, caching etc. They mostly fail to optimally work with the data. Optimal in this case meaning getting most of all underlying hardware capabilities, for example reading large and preferably continuous blocks of data from magnetic storage, parallel data processing, keeping intermediate results in CPU caches, utilizing all physical SSD queues.
It's all use-case and priority-specific, and I think the more varied your experience and more tools you have in the tool belt, the better off you can be to bring the right solution to bear. Of course, then you think you have the right solution in mind (lets say using partitions in postgres for something) but you find the ORM your service is using doesn't support it, then what is "best" becomes not only problem-specific but also tool-specific. Finally, even if you have the best solution and your existing ecosystem supports it but the rest of the engineering staff you have is unfamiliar with it, it may again no longer be "best".
this ladder of problem-fit, ecosystem-fit, staffing-fit is something I have grappled with in my career.
LLMs are only so-so at any of the above (even when including the agent as "staff".)
Of course with experience, you start to feel when the straight forward suboptimal code will cause massive performance issued. In this case it's critical to take action up front to avoid the mess. Its called software architecture, I guess.
If you can't tell in advance what is performance critical, then consider everything to be performance critical.
I would then go against rule #3 "Fancy algorithms are slow when n is small, and n is usually small". n is usually small, except when it isn't, and as per rule #1, you may not know that ahead of time. Assuming n is going to be small is how you get accidentally quadratic behavior, such as the infamous GTA bug. So, assume n is going to be big unless you are sure it won't be. Understand that your users may use your software in ways you don't expect.
Note that if you really want high performance, you should properly characterize your "n" so that you can use the appropriate technique, it is hard because you need to know all your use cases and their implications in advance. Assuming n will be big is the easy way!
About rule #4, fancy algorithms are often not harder to implement, most of the times, it means using the right library.
About rule #2 (measure), yes, you absolutely should, but it doesn't mean you shouldn't consider performance before you measure. It would be like saying that you shouldn't worry about introducing bugs before testing. You should do your best to make your code fast and correct before you start measuring and testing.
What I agree with is that you shouldn't introduce speed hacks unless you know what you are doing. Most of performance come from giving it consideration on every step. Avoiding a copy here, using a hash map instead of a linear search there, etc... If you have to resort to a hack, it may be because you didn't consider performance early enough. For example, if took care of making a function fast enough, you may not have to cache results later on.
As for #5, I agree completely. Data is the most important. It applies to performance too, especially on modern hardware. To give you an very simplified idea, RAM access is about 100x slower than running a CPU instruction, it means you can get massive speed improvement by making your memory footprint smaller and using cache-friendly data structures.
As for rule 2: first you measure.
edit: s/data/data structure/
Good software can handle crap data.
Knuth later attributed it to Hoare, but Hoare said he had no recollection of it and suggested it might have been Dijkstra.
Rule 5 aged the best. "Data dominates" is the lesson every senior engineer eventually learns the hard way.
I learnt about rule-5 through experience before I had heard it was a rule.
I used to do tech due diligence for acquisition of companies. I had a very short time, about a day. I hit upon a great time saver idea of asking them to show their DB schema and explain it. It turned out to be surprisingly effective. Once I understood the scheme most of the architecture explained itself.
Now I apply the same principle while designing a system.
Getting competent at it, however, is no joke and takes time.
Number 5 is timeless and relevant at all scales, especially as code iterations have gotten faster and faster, data is all the more relevant. Numbers 4 and 3 have shifted a bit since data sizes and performance have ballooned, algorithm overhead isn't quite as big a concern, but the simplicity argument is relevant as ever. Numbers 2 and 1 while still true (Amdahl's law is a mathematical truth after all), are also clearly a product of their time and the hard constraints programmers had to deal with at the time as well as the shallowness of the stack. Still good wisdom, though I think on the whole the majority of programmers are less concerned about performance than they should be, especially compared to 50 years ago.
There are a lot of systems where useless work and other inefficiencies are spread all over the place. Even though I think garbage collection is underrated (e.g. Rustifarians will agree with me in 15 years) it's a good example because of the nonlocality that profilers miss or misunderstand.
You can make great prop bets around "I'll rewrite your Array-of-Structures code to Structure-of-Arrays code and it will get much faster"
https://en.wikipedia.org/wiki/AoS_and_SoA
because SoA usually is much more cache friendly and AoS makes the memory hierarchy perform poorly in a way profilers can't see. The more time somebody spends looking at profilers and more they quote Rule 1 the more they get blindsided by it.
On #5, I think most people tend to just lean on RDBMS databases for a lot of data access patterns. I think it helps to have some fundamental understandings in terms of where/how/why you can optimize databases as well as where it make sense to consider non-relational (no-sql) databases too. A poorly structured database can crawl under a relatively small number of users.