November 5, 2009: Treaps Compile!
I recently wrote a small implementation of treaps in Fortress. If you look in the Fortress repository you'll see a couple of versions of the code. source:/trunk/Library/Treap.fss is a standalone treap library for use with the interpreter. source:/trunk/ProjectFortress/compiler_tests/TreapAndTest.fss is the same library with some testing code. This latter works in the compiler as of this week, and marks the first large data structure to compile and run with the compiler.
Why Treaps are Interesting
A treap is a form of randomly-balanced binary search tree. Like any binary search tree, each node is labeled with a key k. Since we're implementing a key/value map, each node also contains the unique value v corresponding to that key. The key of a given node is strictly greater than the keys of nodes in its left subtree, and strictly less than the keys of nodes in its right subtree. We find the node containing a particular key as follows:
Note that sometimes a particular key is not contained in the tree, and we return the Empty node instead.
In addition to a key and a value, a treap node contains a weight w. A treap is heap ordered with respect to the weight: in our implementation, the weight of a node is larger than the weight of any node in either of its subtrees. Each time we insert a new key/value pair into the tree, we choose a weight for its node uniformly at random:
leaf1(key: ZZ32, val: String): TreapNode =
Node(Empty, randomZZ32(2147483647), key, val, Empty)
There's lots more to know about treaps, and you can learn more from the Wikipedia entry, and find links to code and publications on Cecilia Aragon's treap page.
Of particular interest to parallel programmers, though, is the work that Guy Blelloch and Margaret Reid-Miller did on fast parallel set operations using treaps. They provide a particularly simple and clear formulation of set operations in terms of two primitive operations on a treap:
split(key)
splits a treap at key, and returns a triple of a treap whose keys are less than key, the node containing key (or Empty), and a treap whose keys are greater than key.
join(r)
joins this treap to the treap r, all of whose keys are strictly larger than the keys in this treap.
The split operation is no more complicated than the nodeWithKey operation above:
This code is simple because treaps do not need to be dynamically balanced! Most familiar binary search tree data structures have a balance invariant that must be maintained. For example, AVL and red-black trees require that left and right subtrees have approximately the same depth (though each achieves this goal in a rather different way); the tree data structures found in the Set and Map libraries in Fortress are size-balanced. Regardless of the balance metric used, however, in general we need to perform rotations to bring the tree into balance when we add or remove nodes.
Tree rotations are relatively easy to do when we're inserting or removing a single node at a time. Things get more complicated when we want to perform set operations on our trees. Why do we care? Parallelism! We want to structure our computations to produce individual pieces of data in parallel, then combine them together. Alternatively, we can perform parallel bulk operations on large sets and maps that already exist.
With treaps, we don't need to worry about rebalancing: we just put the node with the largest weight on top. In the split code above, that means we don't need to reorganize the tree at all: the node we're looking at always has the largest weight in its subtree, so it forms the root of whichever result it belongs in.
Joining two treaps works just the same way: we put the node with the larger key at the top. The code is a bit more complicated when we write it using straight OO style, because we need to detect and handle empty subtrees while avoiding type errors:
Fundamentally, when we're joining two Nodes, we put the one with the larger weight on top, and push the other one down the appropriate subtree. Note that we can assign Empty nodes the smallest possible weight by convention and simplify the above code somewhat---however, in doing so, we'll end up copying the right spine unnecessarily in some cases.
Why treaps are balanced
Treaps are of course only probabilistically balanced. Do they give us the asymptotic behavior we would expect in the common case? They do. The best way to understand the argument is by analogy to quicksort. Quicksort is expected O(n lg n) when we choose the pivot uniformly at random from the n inputs. Now, imagine we label each of the n data points with a weight w, and choose the data point with the largest weight as the pivot in each recursive sorting step. Note that all the remaining data has a weight chosen uniformly at random from the interval [0,w). We can think of our treap as representing the pattern of work in the quicksort; each key represents a choice of pivot value, the left subtree is the left subproblem of quicksort, and the right subtree is the right subproblem of quicksort.
Note that this analogy is imperfect: it doesn't tell us why split and join operations might be efficient, for example. That requires us to argue that the tree has expected depth O(lg n), and then look at the cost of pulling a tree apart along a path (split) or joining the right boundary of one tree with the left boundary of another (join). But once we know the expected tree heights, these arguments are actually reasonably straightforward---and any tree that results from any combination of operations turns out to look just like a tree that we might have constructed from first principles using our modified quicksort algorithm.
Caveats in the current implementation
There are a few caveats here: we're wrestling with a type checker bug that prevents us from separately compiling the Treap library and its test code. Right now the Treap library implements maps from ZZ32 keys to String values; ultimately we want the library to be generic in both the key and value types, and we are well on our way to achieving this goal. Once that happens, expect a flurry of activity on the libraries front as we move most of the existing Fortress code over into the compiled world.
Another caveat is that we can't return tuples from functions yet (though we can do immediate tuple bindings (a,b,c) = (x,y,z) ). As a result the split method has been split into three: splitL, nodeWithKey, and splitR.
More to come
In a future post, I'll discuss how combining operations work, and I'll also say a few words about the leaf node optimization you'll find in the repository code.

rss

Comments
No comments.