Posted by yagizdegirmenci 10/26/2024
https://commoncog.com/becoming-data-driven-first-principles/
https://commoncog.com/the-amazon-weekly-business-review/
(It took that long because of a) an NDA, and b) it takes time to put the ideas to practice and understand them, and then teach them to other business operators!)
The ideas presented in this particular essay are really attributed to W. Edwards Deming, Donald Wheeler, and Brian Joiner (who created Minitab; ‘Joiner’s Rule’, the variant of Goodhart’s Law that is cited in the link above is attributed to him)
Most of these ideas were developed in manufacturing, in the post WW2 period. The Amazon-style WBR merely adapts them for the tech industry.
I hope you will enjoy these essays — and better yet, put them to practice. Multiple executives have told me the series of posts have completely changed the way they see and run their businesses.
Business leaders like to project success and promise growth that there is no evidence they will or can achieve, and then put it on workers to deliver that, and when there's no way to achieve the outcome other than to cheat the numbers, the workers will (and will have to).
At some point businesses stopped treating outperforming the previous year's quarter as over-delivering, and made it an expectation, regardless of what is actually doable.
It then discusses ways that the factory might cheat to get higher numbers.
But it doesn't even mention what I suspect the most likely outcome is: they achieve the target by sacrificing something else that isn't measured, such as quality of the product (perhaps by shipping defective widgets that should have been discarded, or working faster which results in more defects, or cutting out parts of the process, etc.), or safety of the workers, or making the workers work longer hours, etc.
As "Goodhart's law" is used here, in contrast, the focus is on side effects of a policy. The goal in this situation is not to make the target useless, as it is if you're doing central bank policy correctly.
They need something they can check easily so the team can get back to work. It's hard to find metrics that are both meaningful to the business and track with the work being asked of the team.
You can look at revenue and decide "hey, we have a problem here" and go research what's causing the problem. That's a perfectly valid used for a KPI.
You can do some change by something like the Toyota process, saying "we will improve X", make the change, and track X out to see if you must revert or keep it. That is another perfectly valid use for a KPI.
What you can't do is use them to judge people.
Its easy to fake one metric, it harder to consistenly paly around 100 of them.
(But then it’s no longer KPIs probably, as one looking at the data needs to recognise that details and nuance are important)
Do you have enough KPIs that you can be sure that these targets also serve as useful metrics for the org as a whole? Do you randomize the assignment every quarter?
As I talk through this ... have you considered keeping some "hidden KPIs"?
can't say what the deep idea in this case is per se (haha (maybe the other commenter can shed light on that part)), but i guess if you have enough KPIs to be able to rotate them you have yourself a perpetual motion machine of the same nature as the one that some genius carried down from the mountain on stone tablets that we can sustain maximum velocity ad infinitum by splitting our work into two week chunks and calling them "sprints"... why haven't marathoners thought of this? (h/t Rich Hickey, the source of that amazing joke that i butcher here)
maybe consciousness itself is nothing more than the brain managing to optimize all of its KPIs at the same time.
A 90 day target is questionable, but regularly changing the metrics are a good way to keep people from gaming them.
Here's the thing, there's no fixing Goodhart's Law. You just can't measure anything directly, even measuring with a ruler is a proxy for a meter without infinite precision. This gets much harder as the environment changes under you and metrics' utility changes with time.
That said, much of the advice is good: making it hard to hack and giving people flexibility. It's a bit obvious that flexibility is needed if you're interpreting Goodhart's as "every measure is a proxy", "no measure is perfectly aligned", or "every measure can be hacked"
I might be wrong but I feel like WBR treats variation (looking at the measure and saying "it has changed") as a trigger point for investigation rather than conclusion.
In that case, lets say you do something silly and measure lines of code committed. Lets also say you told everyone and it will factor into a perforance review and the company is know for stack ranking.
You introduce the LOC measure. All employees watch it like a hawk. While working they add useless blocks of code an so on.
LOC commited goes up and looks significant on XMR.
Option 1: grab champagne, pay exec bonus, congratulate yourself.
Option 2: investigate
Option 2 is better of course. But it is such a mindset shift. Option 2 lets you see if goodhart happened or not. It lets you actually learn.
(a) All processes have some natural variation, and for as long as outputs fall in the range of natural process variation, we are looking at the same process.
(b) Some processes apparently exhibit outputs outside of their natural variation. when this has happened something specific has occurred, and it is worth trying to find out what.
In the second case, there are many possible reasons for exceptional outputs:
- Measurement error,
- Failure of the process,
- Two interleaved processes masquerade as one,
- A process improvement has permanently shifted the level of the output,
- etc.
SPC tells us that we should not waste effort on investigating natural variation, and should not make blind assumptions about exceptional variation.
It says outliers are the most valuable signals we have, because they tell us we are not only looking at what we thought we were, but something ... else also.
If companies knew how to make it difficult to distort the system/data, don't you think they would have done it already? This feels like telling a person learning a new language that they should try to sound more fluent.
* Create a finance department that's independent in both their reporting and ability to confirm metrics reported by other departments
* Provide a periodic meeting (for executives/mangers) that reviews all metrics and allows them to alter them if need be
* Don't try to provide a small number of measurable metrics or a "north star" single metric
The idea being that the review meeting of 500+ gives a better potential model. Further, even though 500+ metrics is a lot to review, each should be reviewed briefly, with most of them being "no change, move on" but allows managers to get a holistic feel for the model and identify metrics that are or are becoming outliers (positively or negatively correlated).
The independent finance department means that the reporting of bad data is discouraged and the independent finance department coupled with the WBR and its subsequent empowerment, allow for facilities to change the system.
The three main points (make difficult to distort the system, distort the data and provide facilities for change) need to be all implemented to have an effect. If only the "punishment" is provided (making it difficult to distort the system/data) without any facility for change is putting too much pressure without any relief.
> that **false** proxies are game-able.
You say this like there are measures that aren't proxies. Tbh I can't think of a single one. Even trivial.All measures are proxies and all measures are gameable. If you are uncertain, host a prize and you'll see how creative people get.
https://en.wikipedia.org/wiki/Net_present_value#Disadvantage...
Second off, you don't think it's possible to hide costs, exaggerate revenue, ignore risks, and/or optimize short term revenue in favor of long term? If you think something isn't hackable you just aren't looking hard enough
But I'm suspicious of a claim that it can't be hacked. I've done a lot of experimental physics and I can tell you that you can hack as simple and obvious of a metric as measuring something with a ruler. This being because it's still a proxy. Your ruler is still an approximation of a meter and is you look at all the rulers, calipers, tape measures, etc you have, you will find that they are not exactly identical, though likely fairly close. But people happily round or are very willing to overlook errors/mistakes when the result makes sense or is nice. That's a pretty trivial system, and it's still hacked.
With more abstract things it's far easier to hack, specifically by not digging deep enough. When your metrics are the aggregation of other metrics (as is the case in your example) you have to look at every single metric and understand how it proxies what you're really after. If we're keeping with economics, GSP might be a great example. It is often used to say how "rich" a country is, but that means very little in of itself. It's generally true that it's easier to increase this number when you have many wealthy companies or individuals, but from this alone you shouldn't be about to statements two countries of equal size where all the wealth is held by a single person or where wealth is equally distributed among all people.
The point is that there's always baked in priors. Baked in assumptions. If you want to find how to hack a metric then hunt down all the assumptions (this will not be in a list you can look up unfortunately) and find where those assumptions break. A very famous math example is with the Banach-Taraki paradox. All required assumptions (including axiom of choice) appear straight forward and obvious. But the things is, as long as you have an axiomatic system (you do), somewhere those assumptions break down. Finding them isn't always easy, but hey, give it scale and Goodhart's will do it's magic
Exactly (btw. very nice way to put it)
> stop wasting time on tracking false proxies
Some times a proxy is much cheaper. (Medical anlogy of limited depth: Instead of doing a surgery to see stuff in person, one might opt to check some ratios in the blood first)
This would not count as a false proxy however. The problem in software is, it is very hard to construct meaningful proxy metrics. Most of the time it ends up being tangential to value.
I agree in principle, just want to add a bit of nuance.
Lets take famous “lines of code” metric.
It would be counterproductive to reward it (as proxy of productivity). But it is a good metric to know.
For the same reason why it’s good to know the weight of ships you produce.
The value in tracking false proxies like lines of code, accrues to the tracker, not the customer, business or anyone else. The tracker is able to extract value from the business, essentially by tricking it (and themselves) into believing that such metrics are valuable. It isn't a good use of time in my opinion, but probably a low stress / chill occupation if that is your objective.
In theory you can return to the metrics later for shorter intervals.
From a programming standpoint, and off the top of my head, I would include TDD, code coverage, and anything that comes out of a root cause analysis.
I tell junior devs who ask to spend a little more time on every task than they think necessary, trying to raise their game. When doing a simple task you should practice all of your best intentions and new intentions to build the muscle memory.
I don't know how to track TDD, but for me, code coverage is an example of the same old false proxies that people used to track in the 000s.
Before creating a metric and policing it, make sure you can rigorously defend its relationship to NPV. If you can't do this, find something else to track.
Can't it? Amazon may be an exception, but most of the time running without numbers or quantitative goals seems to work better than having them.