The H2 Wiki


blogpost

A Withering Glance: Map merging in Haskell

The problem

The module Data.Map is a widely-used data structure for storing key-value pairs in great generality (well, reasonable generality: for nearly all purposes the type of keys has to have an instance of Ord). As such, it is the rough equivalent of Python’s dicts; it is a good persistent, functional alternative to hash tables for many purposes.

Part of its popularity, no doubt, comes from the versatility with which they can be combined key-by-key. This is most familiar in the classic set-theoretic operations of union, intersection, and difference, but more bespoke operations are frequently useful. For example, a data structure representing counts of different kinds of objects might be implemented internally as Map k Int, and then unionWith (+) is a combination function which adds everything up. For example, here’s how to work out how many pets of various sorts two people have between them:

λ> import qualified Data.Map as M
λ> myPets = M.fromList [("cats", 3), ("dogs", 2)]
λ> yourPets = M.fromList [("cats", 2), ("hamsters", 1)]
λ> M.unionWith (+) myPets yourPets
fromList [("cats",5),("dogs",2),("hamsters",1)]

Map merging in the past

Up until the mid-2010s, versions of the containers library had an efficient merge operation with a fairly lengthy type:

mergeWithKey :: Ord k => (k -> a -> b -> Maybe c) -> (Map k a -> Map k c) -> (Map k b -> Map k c) -> Map k a -> Map k b -> Map k c

The last three types are the two input maps and the output map, and the documentation had this to say about the first three inputs:

When calling mergeWithKey combine only1 only2:

  • if a key is present in both maps, it is passed with both corresponding values to the combine function. Depending on the result, the key is either present in the result with specified value, or is left out;
  • a nonempty subtree present only in the first map is passed to only1 and the output is added to the result;
  • a nonempty subtree present only in the second map is passed to only2 and the output is added to the result.

The documentation had this warning about the only1 and only2 tactics (emphasis as in the original):

The only1 and only2 methods… must return a map with a subset (possibly empty) of the keys of the given map

This restriction is not just there to reflect the semantics, but is a safety requirement. Since the results of these tactics are inserted as complete chunks into the corresponding part of the resulting data structure, a bad choice of these methods (such as M.mapKeys (+ 1000000)) could result in an invalid map: a binary tree with keys on the left that are greater than keys on the right.

There are further possibilities for breaking things that are so perverse that the writers of the library may not have even thought of them. For example, one could take only1 to be the function

\m -> fmap (const (M.size m)) m

which replaces every value of m by the number of entries in m. This is bad because the answers depend on the size of the subtree it’s handed, which in turn depends on how the shapes of the trees happen to match up. In a compliant implementation, every value of the output should depend only on the key and the value of the input.

Another irritating feature is the specific nature of it: the tactics for a merge have Data.Map built in to them, and hence only work for Data.Map; this problem is discussed more later.

Map merging in the present

This interface changed some years ago, from version 0.5.8.1 of containers, released in August 2016.

The signature of the function, now simply known as merge, became

merge :: Ord k => SimpleWhenMissing k a c	   -- What to do with keys in m1 but not m2
               -> SimpleWhenMissing k b c      -- What to do with keys in m2 but not m1
               -> SimpleWhenMatched k a b c	   -- What to do with keys in both m1 and m2
               -> Map k a
               -> Map k b
               -> Map k c

In fact (besides a small change to the underlying algorithm) this embodies several changes, which I’ll summarise in turn:

Encapsulation

As can be seen, we now have specially named types for the merge tactics. In the case of SimpleWhenMatched, this mostly just has the effect of giving types names that explain what they do. The definition of SimpleWhenMatched is almost (but, for reasons to be explained shortly, not quite) as follows:

newtype SimpleWhenMatched k a b c = SimpleWhenMatched
  { matchedKey :: k -> a -> b -> Maybe c }

and that is just a newtype for the previous version’s combine argument.

Specialisation of WhenMissing

The SimpleWhenMissing tactics, on the other hand, are genuinely different: the definition is almost (but, again, not quite) as follows:

data SimpleWhenMissing k a b = SimpleWhenMissing
  { missingSubtree :: Map k a -> Map k b
  , missingKey :: k -> a -> Maybe b }

In other words, we lend the algorithm a hand by telling it what to do not just on missing subtrees, but on individual missing keys (in case they occur not as a subtree, but as the root of a tree).

This is duplicate information, provided only for efficiency. Indeed, in a compliant implementation each method should be fully determined by the other:

missingSubtree = M.mapWithKey missingKey
missingKey k v = M.lookup k . missingSubtree $ M.singleton k v

In order to assist the user, a number of default tactics with sensible names are provided, such as preserveMissing and dropMissing.

Genericity

You may have been wondering what’s so simple about SimpleWhenMissing. The answer is that merging was made significantly more general, and in fact merge is just a special case of a more general function taking values in any applicative functor:

mergeA :: (Applicative f, Ord k) => WhenMissing f k a c      -- What to do with keys in m1 but not m2
                                 -> WhenMissing f k b c      -- What to do with keys in m2 but not m1
                                 -> WhenMatched f k a b c    -- What to do with keys in both m1 and m2
                                 -> Map k a
                                 -> Map k b
                                 -> f (Map k c)

If f is Identity we just get merge back, and SimpleWhenMissing is WhenMissing Identity, and SimpleWhenMatched is WhenMatched Identity, and it’s the presence of the Identity newtypes which explains my deliberate omissions above.

This function mergeA has a bewildering array of potential uses for map combining.

You want to take a union (preferring the second map’s value in case of matches), but keep count of the number of matching keys as you go? That’s no problem: you can do it with the applicative functor (,) (Sum Int).

You want to do the same thing, but actually keep track of the values that have been lost, thus taking a union and an intersection simultaneously, but still only performing only one traverse of the tree structures? Sure you can: you can do it with the applicative functor Complex (or, more verbosely, the applicative functor Identity :*: Identity).

You want to list all possible maps obtained by taking a union with any possible convention for preferring one value over the other in case of a match? No problem, that’s just the list functor []:

λ> import qualified Data.Map as M
λ> import qualified Data.Map.Merge.Strict as M
λ> compromises = M.mergeA M.preserveMissing M.preserveMissing (M.zipWithAMatched (\_ r s -> if r == s then [r] else [r,s]))
λ> m1 = M.fromList [("breakfast","cereal"), ("lunch","soup"), ("dinner","stew"), ("drink","beer")]
λ> m2 = M.fromList [("lunch","sandwich"), ("dinner","stew"), ("drink","wine")]
λ> compromises m1 m2
[fromList [("breakfast","cereal"),("dinner","stew"),("drink","beer"),("lunch","soup")],
 fromList [("breakfast","cereal"),("dinner","stew"),("drink","beer"),("lunch","sandwich")],
 fromList [("breakfast","cereal"),("dinner","stew"),("drink","wine"),("lunch","soup")],
 fromList [("breakfast","cereal"),("dinner","stew"),("drink","wine"),("lunch","sandwich")]]

Safety

The new implementation does provide some increased safety for the novice user, since only the standard tactics are imported by default and these are guaranteed safe to use, while the power to create new tactics is locked away in an Internal library. This security-through-indirection has become commonplace in Haskell and it’s no bad thing.

However, while the default tactics provided are theoretically fully general, the generality comes from using functions which work key-by-key. There are natural tactics which can’t be implemented efficiently using the defaults: an example is the WhenMissing tactics, for the example above of calculating a union and intersection simultaneously, which are built using a preserveMissing on the first coordinate and a dropMissing on the second.

Specificity

The situation for genericity actually got slightly worse with these changes. There are many other data structures which store arbitrary key-value pairs:

The issue is that all of these can be provided with good tactic-based merge functions, all of which are (to varying degrees) faster than the naive ones, but implementing each requires basically the entire tactics code to be copied. This has been done for IntMap. If only it could be done once and for all in a way that works nicely for anything resembling a map! (But read on…)

Map merging in the future?

To summarise the above, the present merge framework has some downsides:

Is it possible to provide a framework for map merging which is fully generic (that is, will work for any fairly similar key-value data structure), fully safe (that is, without any possibility of illegal tactics), fully succinct (that is, requiring no duplicate labour), and equally efficient in practice? I believe so, and here’s how.

The main ingredient is the witherable package, by Fumiaki Kinoshita, which builds on top of the indexed-traversable package, to supply some useful typeclasses for maps.

Abstracting merge

To warm up, I’ll explain how to abstract the merge operation, even though the real goal is of course mergeA. An instance m of the Filterable typeclass (such as Map) has a filtering operation:

mapMaybe : (a -> Maybe b) -> m a -> m b

There is an “indexed” variant, FilterableWithIndex, which has an indexed filtering operation:

imapMaybe : (i -> a -> Maybe b) -> m a -> m b

This time, Map i is an instance of FilterableWithIndex i: the method imapMaybe is M.mapMaybeWithKey. An imapMaybe is theoretically precisely the sort of thing you might do to make a compliant SimpleWhenMissing tactic, but is inefficient in practice.

However, a reasonable way of defining a SimpleWhenMissing tactic to be something that operates on all FilterableWithIndex instances:

data Filtering i a b = Filtering {
  runFilter :: forall m. FilterableWithIndex i m => m x -> m y }

What examples are there? Well, we can use any functions that are available on any FilterableWithIndex, which includes any of the superclasses of FilterableWithIndex (including Filterable, Foldable, and Functor, and their indexed versions).

So we can

Filtering (imapMaybe f)

to do a very general merge tactic (equivalent to mapMaybeMissing) in the current library, and

Filtering (imap f)

(using imap from FunctorWithIndex) to get an equivalent of mapMissing, which doesn’t delete any elements and hence runs faster since it does not rebalance the tree, and

Filtering id

to do the equivalent of preserveMissing in the standard library and runs very fast indeed since it just copies the tree. The neat thing is that none of these require pre-prepared tactics: they are just functions that work on a FilterableWithIndex such as Map.

This tells us what to do on a map, but what about the equivalent of missingKey? The solution here is to build an instance of FilterableWithIndex that can only store one key:

newtype IMaybe i x = IMaybe {
  iMaybe :: Maybe (i,x) }

instance FilterableWithIndex i (IMaybe i) where
  imapMaybe f (IMaybe a) = IMaybe (itraverse f =<< a)

We can recover missingKey by inspecting the action of a Filtering on an IMaybe (in the reasonable hope that the optimising compiler strips out the abstraction).

Hence one correct answer might be: one proper type for a SimpleWhenMissing tactic is, in fact, a Filtering. But we won’t do that, because abstracting mergeA is the goal.

Abstracting mergeA

The problem of abstracting mergeA is similar, but requires working in a different typeclass. Recall that an instance m of the Traversable typeclass has a method

traverse :: Applicative f => (a -> f b) -> m a -> f (m b)

The Witherable typeclass can be thought of as a common generalisation of Filterable and Traversable; it has a method

wither :: Applicative f => (a -> f (Maybe b)) -> m a -> f (m b)

which generalises filter if we take f to be Identity, and generalises traverse in the sense that

traverse k = wither (fmap Just . k)

The WitherableWithKey typeclass is an indexed variant of that:

wither :: Applicative f => (i -> a -> f (Maybe b)) -> m a -> f (m b)

Using this typeclass, we can imitate the previous section (and also build the applicative functor action in):

newtype Withering f i x y = Withering {
  runWither :: forall n. WitherableWithIndex i n => n x -> f (n y) }

This is the proper type for a WhenMissing tactic, and allows you to readily define tactics which operate with the right level of efficiency. The types Map and IMaybe as defined above also have a WitherableWithIndex implementation, so they can be acted upon by a Withering.

Dropping subtrees

There’s one tactic I have refrained from mentioning amidst the optimism of the last two sections: the difficulty of efficiently dropping a subtree (dropMissing in the current library). This can be accommodated, but I feel that to do so nicely will involve a small amount of library redesign.

The issue is that mapMaybe (\_ -> Nothing) is a constant for most instances of Filterable (and perhaps it should be required to be a constant in any well-behaved instance). This is desirable, since a good, efficient dropMissing should quickly provide an empty map for the structure.

This wants the creation of a new class, something like

class HasEmpty f where
  empty :: f v
  null :: f v -> Bool

Strictly speaking, the provision of null is not relevant. In practice, however, testing whether a data structure is empty is possible pretty much exactly when it’s possible to generate an empty data structure, and it’s a useful thing to have around (more fundamental than equality testing, since, for example we only have (Eq k, Eq v) => Eq (Map k v), but we can test if a map is empty without any conditions on the keys or values).

This class HasEmpty should be a superclass of Filterable: if you can filter elements out, you can filter everything out and produce an empty structure. Having done so, we can easily define a dropping tactic:

dropping :: (Applicative f) => Withering f i x y
dropping = Withering (\_ -> pure empty)

I think there are probably some intrinsic advantages to putting such basic functionality in a typeclass anyway.