behaviour-changes-flag-day

Behaviour changes on flag day

At work (Groq) we recently upgraded the version of GHC that we use from 8.10 to 9.6, along with many of the Haskell packages we depend on. I wrote about the changes to our code that this required in “Upgrading from GHC 8.10 to GHC 9.6 – an experience report”. That article touches on the risks associated with being forced to make many changes at once: if a problem occurs it can be hard to diagnose and it can be hard to fix. After that article was written I discovered that such a problem, albeit a benign one, had occurred. The present article explains the situation.

We use Nix’s nixpkgs to provide GHC and Haskell packages. Consequently, we upgraded GHC and our Haskell packages at exactly the same time by upgrading to a later version of nixpkgs, specifically all in the same commit to our source code repository. I’m not familiar with Nix or nixpkgs so I can’t explain exactly why we followed the approach of upgrading GHC and all packages at the same time but it was deemed to be the most straightforward of a variety of different options. Regardless, that means we experienced a “flag day”. Paraphrasing Wikipedia’s definition and elaboration:

A flag day is a change which requires a complete conversion of a sizable body of software. The change is large and expensive, and—in the event of failure—similarly difficult and expensive to reverse.

The situation may arise if there are limitations on backward compatibility and forward compatibility among system components, which then requires that updates be performed almost simultaneously (during a “flag day cutover”). This contrasts with the method of gradually phased-in upgrades, which avoids the disruption of service caused by en masse upgrades.

In the language of my previous article, a “flag day cutover” is required when incorporating a large number of “breaking fixes”. By contrast, “forward compatible mitigations” can be made as “gradually phased-in upgrades”.

We experienced a “failure” of a sort after our flag day, albeit a benign one, specifically the failure of a golden test, that is, a test that checks for exact equality of the output of a particular function or program. Golden tests often over-specify behaviour and that was the case here too: the output artifact under test was a program for Groq’s LPU¹, and although the binary contents of the program file differed before and after flag day, it produced the same results, so the test was too strict.

So, although there was a test failure, it was not actually indicative of a bug. There was an easy fix: update the golden test. But I was curious: what caused the difference in behaviour? It was very difficult to determine because the behaviour change occurred across a flag day boundary! If we had been able to upgrade GHC and each package independently we could have made those upgrades as separate commits to our repository and subsequently I could have discovered which package caused the test failure by bisecting through the repository history.

On the other hand, the situation could have been much worse: if the upgrade of one of our dependencies had caused a bug then we would have been in a difficult position! There would be no way to test the upgrades individually to determine which caused it (or indeed, which of our breaking fixes accidentally introduced it).

My impression is that upgrading a compiler and dependencies using predetermined package snapshots where all packages change at once, whilst an extremely convenient way of avoiding dependency conflicts, is very awkward when it forces you to incur a flag day. Is there perhaps a way of getting the best of both worlds: a predetermined package snapshot that contain two versions of each package, that can be upgraded independently of each other? I would be interested to know.

Appendix: what was the cause?

It doesn’t really matter what the exact cause was, because the change was ultimately of no consequence. However, I wanted to know for the sake of my own interest.

Because I couldn’t bisect through the repository history I resorted to eyeballs and intuition. After comparing the new output to the golden test expected output I concluded that the difference must be due to a different iteration order in a particular stage of our assembler pipeline. Looking at the data structures involved in that stage, the only one which I felt could undergo a change in iteration order was HashMap. The iteration order of HashMap is the numeric order of the hash of the keys. Had the Hashable instance of the key type changed? Indeed yes. In hashable’s change log, the version 1.4.3.0 entry contains “Change hashInt to mix bits more”, and our key hashes ultimately depended on hashInt.

To see this in practice, hashWithSalt 1 (13 :: Int) is 16777630 under hashable version 1.3.0.0 (which we were using before flag day) and -6919028725695267684 under version 1.4.3.0 (which we are now using after flag day).

To be clear: I am not blaming hashable. Firstly, there was no bug in the first place, rather the failure of an overly-strict golden test. Secondly, hashable states very clearly that

Hashable does not have a fixed standard. This allows it to improve over time. Because it does not have a fixed standard, different computers or computers on different versions of the code will observe different hash values

I am not completely certain that this is the cause of the issue because I can’t test the change to hashable in isolation! But I am fairly confident.

The H2 Wiki

behaviour-changes-flag-day

Behaviour changes on flag day

Appendix: what was the cause?

Links