ArrayList versus CovariantCollection

(A case study in Fortress performance analysis)

We noticed that creating a large list using a list comprehension was very expensive---much more expensive than we would have expected (certainly more expensive than creating an equivalently-sized array). Why?

We needed to understand the performance of ArrayLists (the default List implementation in List.fss) and test a proposed remedy against it. We had a few hypotheses:

  • ArrayList allocates quite a bit of extra storage for each singleton leaf list. Perhaps this is proving overly expensive for large list sizes. We rejected this hypothesis early on, as using this extra storage actually speeds up large list creation quite a bit; it's really more of a concern if we create many small lists.
  • ArrayList construction using a comprehension might be scaling non-linearly. This is because we allocate an ArrayList for each list element, then successively append them treewise. Perhaps we miscalculated the copy overhead of doing this. This was our most plausible hypothesis.
  • ArrayLists use inefficient copy loops and a teensy bit of mutable storage to permit sharing of sublists and to avoid extra allocation. Possibly this was proving too expensive, and we should consider alternate implementations.

Step 0: Find a baseline for comparison

As a basis for comparison, we wrote CovariantCollection.fss, which is nearly the most naive possible implementation of a parallel comprehension: it consists of a singleton element, an empty element, and two covariant collections appended together. The only operation it supports other than append is flattening into an array (for this purpose we cache the collection size at each append node). A CovariantCollection occupies strictly more memory than an ArrayList when the data it contains is non-trivial in size. It's relatively easy to take the resulting collection, convert it to an ImmutableArray, and then convert that into a well-formed ArrayList so that we can check our work.

Any acceptable implementation of list comprehensions should be at least as fast as CovariantCollection.

Step 1: Write some evaluation code

In order to see how ArrayList and CovariantCollection scale with increasing list size, we need to write a bit of test code that times list creation at various sizes. This makes use of the timing library Timing.fss, which collects timing information in a form suitable for printing. We choose to create a geometric progression of list sizes, starting with size 1. The code, which can be found in CovCollTest.fss, looks roughly like this:

Loading...

There's a bit more printing code in the actual program, plus some code to warm the system up when the problem size is small (this turned out to eliminate some noise in the small measurements; we get pretty consistent results from run to run).

Note that the above code scales trials by powers of two. Initially we actually started with powers of ten, using the results to prototype our data analysis and understand roughly what's going on; we refined this to yield more data points as the code and the measurement infrastructure got more stable (we moved first to exp(i)before continuing on to powers of 2).

The definition of outtime in particular is worth noting. Observe that in addition to the observed time, we also produce scaled versions of the time divided by n and by n log n. These provide a useful first sanity check: if the code is not obeying these space bounds, the scaled numbers should increase steadily as problem size grows. Notice, also, that we construct c'n'p as we go (this is short for "covariant and parallel", for stupid reasons). This digests the results into a tab-separated file suitable for loading into a spreadsheet or graphing package.

Discover limits of program usage

Before increasing the number of samples, we did some experiments to figure out the outer limits of our data structures. With a 2 gigabyte heap (enormous!), CovariantCollection performance drops suddenly and dramatically between half a million and a million elements. This is due to heap exhaustion and GC thrashing. Note that this doesn't speak well to the per-cell overhead of our data structure: there are n-1 append nodes and n singleton nodes, and each takes about 1K of storage! Maybe with more careful accounting we can cut that by a couple of factors of two, but practically speaking it's way too much space. Beyond this point ArrayLists are faster than CovariantCollections.

Fix trivial performance problems in comparison code

We did a couple of trials of various different array creation protocols for CovariantCollection and discovered that there were dramatic differences in speed between them. The initial implementation was about twice as fast as ArrayList, but alternate implementations ran as much as five times as fast. We quickly discovered that the differences could be explained by the amount of run-time type inference that was occurring in the course of tree traversal (oddly, type inference during tree construction, which was also O(n), didn't appear to be creating trouble). The types involved were covariant, and the covariance constraints between different type variables needed to be enforced. All of this was easily solved by turning back on a cache from dynamic argument types to fully inferred and instantiated function closures (we use 2-3 different instantiations all told, we just use one of them repeatedly). This cache had been put in early on, then thrown out because it wasn't making a noticeable difference to performance. It now makes an enormous difference, both here and in PureList.fss as well (the latter is now approximately as fast as !List itself, so the two can be used interchangeably).

Figure out whether behaviors are O(n) or O(n lg n)

None of the executions was blatantly non-linear. But there was definitely a kink in both curves: initial performance was slow, then we reached a regime where we had steady-ish looking ratios. But steady enough? We decided to do some curve fitting to find out. After fiddling about with a spreadsheet for a bit, we ended up using gnuplot at Jon Rafkind's recommendation (I'd forgotten that gnuplot could fit curves). We started out first by plotting the data on a log/log plot, then when we were happy with that did the same with linear and logarithmic curves. We ended up with something like the following:

[Hmm, need to recover a version off backup.]

After quite a few more experiments (it's here that we started increasing the number of data points, along with finally deploying bug fixes from the previous section) it became quite clear that a linear fit worked well for both curves. However, we learned a few important things about doing good curve fitting. The most important is that error estimates matter! Note that because our graph is a log/log plot, the smallest and largest points differ by about 6 orders of magnitude. If we fit a curve using simple least squares, we effectively end up fitting the largest two points and ignoring the rest. After a bunch of thrashing around (eg trying to fit a horizontal line to the normalized points, throwing away constant factors, inverting coefficients, and much other nonsense) we found better documentation for gnuplot's fit command. It turns out the third argument is an error estimate. Here we should make the error estimate proportional to the run time. We tried various other estimates in an effort to understand the command, and this seems to yield good fits that actually look accurate.

There was one final issue: the results looked a bit piecewise-linear (so the points in the middle lifted substantially away from the fitted line, while the low and high ends were close). Was this due to the effects of parallelism kicking in? We'd been running on two processors. We tried a second set of runs on a uniprocessor, and fit and plotted both sets of points on a common graph:

set logscale
set title "Scaling of List.fss and CovariantCollection.fss"
set ylabel "Time(s)"
# set yrange [0.005:1000]
set key 25,500
set xlabel "Number of elements in list"
cum = .0001; cuc = .005
lum = .0004; luc = .0004
cnm = .0001; cnc = .005
lnm = .0004; lnc = .0004
fcn(n) = cnc + (n * cnm)
fln(n) = lnc + (n * lnm)
ucn(n) = cuc + (n * cum)
uln(n) = luc + (n * lum)
fit fcn(x) 'CovColl.dat' using 1:3:(($3)) via cnm,cnc
fit fln(x) 'CovColl.dat' using 1:2:(($2)) via lnm,lnc
fit ucn(x) 'CovColl-1T.dat' using 1:3:(($3)) via cum,cuc
fit uln(x) 'CovColl-1T.dat' using 1:2:(($2)) via lum,luc
plot 'CovColl.dat' using 1:2 title '2T List' with points, \
     'CovColl.dat' using 1:3 title '2T CC' with points, \
     'CovColl-1T.dat' using 1:2 title '1T List' with points, \
     'CovColl-1T.dat' using 1:3 title '1T CC' with points, \
     (fln(x)) title '2T Linear List fit', \
     (fcn(x)) title '2T Linear CC fit', \
     (uln(x)) title '1T Linear List fit', \
     (ucn(x)) title '1T Linear CC fit'
print "According to linear fit, CC is better by ", (lnm/cnm), " on 2 procs."
print "                                  and by ", (lum/cum), " on 1 proc."
print "Approximate 2P speedup for CC   is ", (lum/lnm)
print "Approximate 2P speedup for List is ", (cum/cnm)

The results are shown below. Reassuringly, we get a speedup, and constant factors are in the same ballpark; alas, while the fit is better it still isn't great for sizes from 8-128 or so. Note that we can use the fitted curve parameters to approximate speedups, both from one data structure to the other and from 1 processor to 2. Note also that the 1P -> 2P speedup is only about 60%; it's interesting to speculate why this might be, whether it's due to serial bottlenecks or gratuitous parallelism (we see slowdowns due to the latter with non-parallel codes running on a multiprocessor using our present interpreter).

According to linear fit, CC is better by 4.80964196257999 on 2 procs.
                                  and by 4.49570607298715 on 1 proc.
Approximate 2P speedup for CC   is 1.57307435902548
Approximate 2P speedup for List is 1.68292239852781

Still to do

This experiment suggests that there's room for improvement in our choice of List data structures. That said, it's worth seeing what effect compiled and flattened object layouts will have before we attempt too much further mucking about. Best list options are likely to be wide-fanout fixed-depth trees and their ken; PureList is actually the right sort of structure, but its fanout is narrow (at most 4) whereas fanouts that increase leaf width look like they would be beneficial (that constant factor at small list sizes is painful).

Note on converting the image

The right way to produce readable graphs from gnuplot and turn them into pngs appears to be to save the pdf image, then use ImageMagick's convert utility as follows:

convert CovColl.pdf -geometry 640x480 CovColl.png

Perhaps if I'd tweaked native graph sizes in gnuplot I could have made a smaller yet still-readable version, but I had no luck with various other conversion incantations for getting a png, nor with Save As png in Preview.

Attachments