Who broke the build?

From the 2013 Google Testing conference (GTAC), two Googlers discussed how they are building a system to figure out who broke the build.

https://www.youtube.com/watch?v=SZLuBYlq3OM

I highly recommend you watch the talk as it's only 15 minutes long and has some practical implications for anyone who uses CI. 

To summarize some of their ideas:

  • What you do when a build breaks depends on what kind of test breaks the build.
    • If it's a unit test, you are in luck because you can run them within minutes for each of the changes.*
    • If it's a "medium" test (e.g. running 8 minutes or less), you can use a binary search approach and then recurse through to eventually find the build that broke it
    • If it's a "large" test, then you are out of luck because some of these tests can take hours to run and it's infeasible to run them over and over again. This is when an engineer has to manually investigate the changes and figure out who broke the build.
  • The solution... is to use heuristics. Essentially rule of thumbs that work most of the time. They basically score each CL and whoever has the highest score is most "suspected" of having broken the build. The neat part is that they actually show data of how accurate their system was in ranking the actual change that broke the build and it was pretty darn accurate (I think around the top ~1 percentile in most cases) which means that it helped Googlers not look at 99% of the changes when manually identifying who broke the build.
  • The two heuristic patterns that they implemented, although they mentioned there are potentially others:
    • Looking at the "amount" of changes. This is pretty straightforward. If there's many more changes, there's more potential to introduce regression. It's a simple heuristic but it seems to be effective.
    • Looking at the dependency tree. The closer a change was to the core, the less likely they suspected it of breaking for two reasons: 1) people who worked on core libraries that were depended on throughout Google were more likely to be careful and had stricter code review processes and 2) if a key dependency was broken, it was highly likely it would be discovered by another team at Google since it was so widely depended upon.

*Sidenote: In their talk they use the term CL (changelist) which seems to be a concept from their subversion SCM system. In Git, this seems somewhat similar to the idea of a commit (basically a set of changes), however I think it's different because in an IDE like IntelliJ you can actually make your changelist without doing a commit.