More development statistics

The LWN's development statistics are being published at end of each release cycle for as long as I can remember (Linux 6.3 stats). Thinking back, I can divide the stages of my career based on my relationship with those stats. Fandom; aspiring; success; cynicism; professionalism (showing the stats to my manager). The last one gave me the most pause.

Developers will agree (I think) that patch count is not a great metric for the value of the work. Yet, most of my managers had a distinct spark in their eye when I shared the fact that some random API refactoring landed me in the top 10.

Understanding the value of independently published statistics and putting in the necessary work to calculate them release after release is one of many things we should be thankful for to LWN.

Local stats

With that in mind it's only logical to explore calculating local subsystem statistics. Global kernel statistics can only go so far. The top 20 can only, by definition, highlight the work of 20 people, and we have thousands of developers working on each release. The networking list alone sees around 700 people participate in discussions for each release.

Another relatively recent development which opens up opportunities is the creation of the lore archive. Specifically how easy it is now to download and process any mailing list's history. LWN stats are generated primarily based on git logs. Without going into too much of a sidebar – if we care about the kernel community not how much code various corporations can ship into the kernel – mailing list data mining is a better approach than git data mining. Global mailing list stats would be a challenge but subsystems are usually tied to a single list.

netdev stats

During the 6.1 merge window I could no longer resist the temptation and I threw some Python and the lore archive of netdev into a blender. My initial goal was to highlight the work of people who review patches, rather than only ship code, or bombard the mailing list with trivial patches of varying quality. I compiled stats for the last 4 release cycles (6.1, 6.2, 6.3, and 6.4), each with more data and metrics. Kernel developers are, outside of matters relating to their code, generally quiet beasts so I haven't received a ton of feedback. If we trust the statistics themselves, however — the review tags on patches applied directly by networking maintainers have increased from around 30% to an unbelievable 65%.

We've also seen a significant decrease in the number of trivial patches sent by semi-automated bots (possibly to game the git-based stats). It may be a result of other push back against such efforts, so I can't take all the full credit :)

Random example

I should probably give some more example stats. The individual and company stats generated for netdev are likely not that interesting to a reader outside of netdev, but perhaps the “developer tenure” stats will be. I calculated those to see whether we have a healthy number of new members.

Time since first commit in the git history for reviewers
 0- 3mo   |   2 | *
 3- 6mo   |   3 | **
6mo-1yr   |   9 | *******
 1- 2yr   |  23 | ******************
 2- 4yr   |  33 | ##########################
 4- 6yr   |  43 | ##################################
 6- 8yr   |  36 | #############################
 8-10yr   |  40 | ################################
10-12yr   |  31 | #########################
12-14yr   |  33 | ##########################
14-16yr   |  31 | #########################
16-18yr   |  46 | #####################################
18-20yr   |  49 | #######################################

Time since first commit in the git history for authors
 0- 3mo   |  40 | **************************
 3- 6mo   |  15 | **********
6mo-1yr   |  23 | ***************
 1- 2yr   |  49 | ********************************
 2- 4yr   |  47 | ###############################
 4- 6yr   |  50 | #################################
 6- 8yr   |  31 | ####################
 8-10yr   |  33 | #####################
10-12yr   |  19 | ############
12-14yr   |  25 | ################
14-16yr   |  22 | ##############
16-18yr   |  32 | #####################
18-20yr   |  31 | ####################

As I shared on the list – the “recent” buckets are sparse for reviewers and more filled for authors, as expected. What I haven't said is that if one steps away from the screen to look at the general shape of the histograms, however, things are not perfect. The author and the reviewer histograms seem to skew in the opposite directions. I'll leave to the reader pondering what the perfect shape of such a graph should be for a project, I have my hunch. Regardless, I'm hoping we can learn something by tracking its changes over time.

Fin

To summarize – I think that spending a day in each release cycle to hack on/generate development stats for the community is a good investment of maintainer's time. They let us show appreciation, check our own biases and by carefully selecting the metrics – encourage good behavior. My hacky code is available on GitHub, FWIW, but using mine may go against the benefits of locality? LWN's code is also available publicly (search for gitdm, IIRC).