Be careful what you measure because you will optimize it
Humans are very competitive. If you give them a number to rate themselves with, they will try to increase it.
If you find an Open Source project, and it's not very optimized on a certain metric you care about, to make it better, you have to do only one thing: write a benchmark for it. Ideally showing this project is worse in a given metric compared to some other similar projects. You're almost guaranteed that in a short time there will be improvements.
Most people must be familiar with the quote:
“Measurement is the first step that leads to control and eventually to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it.”
― H. James Harrington
It's an important insight and people internalized and wish to apply it... even where it can't be.
Things go wrong when people don't want to accept that there are many important things we just can not “measure”. Complex things are generally not measurable – they do not reduce to a finite set of numbers.
An example that I want to point out specifically is human productivity in complex domains (like SWE). This is not to say that there are no differences in productivity between people, or that we can not observe, compare and judge them. No. I only want to say that human productivity in complex domains can not be reduced to a couple of numbers. And people ignore it to their peril.
I bet a lot of developers are familiar with the story or futile attempts to measure the productivity of developers by the number of lines of code they produce.
But I don't think most of us fully understand just how important are the general principles behind it. Because if we did we would see it being done to our peril over and over everywhere.
The problem with measuring lines of code produced is that it does work! We can measure it, we can understand it, we can control it and we can improve it! Yes! It's just improving the number of lines of code produced was not our goal in the first place.
If you give people a measurement they WILL optimize it! What was desired is to increase productivity, but since that is not measurable, a proxy for productivity was used instead, and then the proxy is optimized at the expense of productivity.
And what's particularly pernicious and dreadful about it is that any metric that you bring to people's attention that looks like a ranking or a benchmark will be treated as such, at the expense of the real goals.
In the example of writing a benchmark for an Open Source project – it wasn't an explicit goal of that project to rate well in benchmarks. There isn't any CEO that yelled at people “why do we look bad on this benchmark?!” It is purely human nature that makes us want to bump any social status-like numbers.
Why do you think almost every bigger company annoys its customers with automated phone calls after you've had the “pleasure” of using their support line? “On the scale from 1 to 5, do you think that that information presented to you was clear...” and so on, trying to waste 10 minutes of your life. It is absurd, so why do companies do it? Because facing the impossibility of measuring something as complex as overall customer satisfaction, they have decided to measure a proxy and now they are focused on measuring and optimizing the one number they can get, to the detriment of the customer satisfaction itself.
Why do all companies insist on sending spammy emails to their customers, even though they know people hate having their inboxes spammed? It is not just because they don't give a shit about their customers. It's primarily because it's easy to measure “emails opened” and “users clicking a link”, so it is too tempting to use it as a proxy for overall marketing effectiveness (which as a complex thing, can't be measured directly) and tirelessly optimize it until customers hate you.
So, the final test for you, dear reader.
- What's going to happen when a software team starts assigning “story points” and tracking “velocity” to measure (and maybe improve) productivity? Will the productivity go up or down?
- Since “more informative commit messages are generally longer” by necessity if you start automatically tracking the length of commit messages and rewarding people who on average write longer commit messages, what is going to happen? Will the quality of commit messages go up or down?
- “Experienced and skilled software engineers deliver novel and important technologies and products”. What will happen if you tie delivering a certain number of such “big impact projects” with your promotion system?
I hope it's completely unnecessary to say that in order:
- Productivity will go down, as the team is busy estimating, tracking, and creating stories for most minute things for the sake of it instead of helping with the work.
- The quality of commits will go down, as they are being filled with pointless fluff, instead of useful information.
- You end up with 10 incompatible messengers, each lackluster, and other short-lived and deteriorating products.
So what to do?
Be honest and mindful about what you can measure directly, and what you can not, and do not mistake a measurement by proxy with the real thing.
Oftentimes there are plenty of actual good measures that can be used, just not on the organizational level you are looking for. For example in car manufacturing which is a complex process, the aggregate number of cars produced every day, defect ratios, etc. are direct measures that are great to optimize for! Probably that's why car companies don't try to measure productivity by counting the number of screws being used every day, or something stupid and misleading like that. Otherwise, we would be certain to drive cars with an absurd amount of screws in them.
Avoid measuring proxies and other non-goals, even just for “informational purposes”. Even if you explicitly say that they are not benchmarks, just raising them to wider awareness is making them optimization targets.
Only ever put in front of your people numbers that you directly want to optimize, even if these are the numbers they are directly responsible for. Product numbers, company numbers – the important numbers that you can measure and care to improve.
Where no direct measurements are possible, just use common sense and rely on human judgment. A manager in a software shop should be able to directly (by observation) and based on the feedback from other peers be able to have a certain sense of individual productivity, among other things. Human judgment can be wrong sometimes but is a much safer bet than blind and gameable numbers.
I hope I mostly wasted your time because you already know and immediately recognize all this as an instance of the more general “the incentives matter” principle and I just pointed out that benchmarks, ranking, and other social status-like games are very powerful incentives, which you also already know!