compressed image deltas

back in the day...

Vista was a landmark release for Microsoft in almost every way, and the beloved MS Paint is no exception - now, users could undo a total of 10 subsequent operations instead of only 3! Such a half-assed attempt to portray Paint as a vaguely modern editor only serves to highlight how artificial these limitations are, especially when even 1980s-era hardware is capable of so much more. So why, then, did Microsoft do it?

As it turns out, Paint's algorithms for backing up the canvas are ridiculously inefficient. For instance, creating a long diagonal line will always back up every single pixel within its minimum bounding rectangle! Additionally, Paint makes no attempt to compress any data, leading to very inefficient representations of those few pixels that actually do matter. Considering all this, it's almost incredible that Paint could ever undo anything at all, especially on those old 386 systems with 4 MB of RAM that it was originally designed for.

Since this is obviously unacceptable to anyone with even a single braincell, I'd like to do better. After all, every megabyte saved is a megabyte that can be used later down the line - this is particularly important considering how frequently my editor uses comparatively expensive binary32 samples to represent colors.

representing spatial regions

First, let's try only backing up pixels that have actually been changed by the current operation. Since both spatial trees and hash tables can easily represent non-contiguous spatial regions, let's implement and compare the two in terms of both CPU and memory usage. The test here is simple - emulate the access patterns of flood fill by appending an NxN grid of elements!

While we could likely get a noticeable performance speedup by filling the tree in Z-order, this isn't always feasible. As such, this can be perceived as the worst case scenario of such a data structure.

alternative 1: spatial trees

While all balanced spatial trees provide O(log N) insertion and retrieval, they differ heavily in terms of memory overhead and real-world performance. A tree based around quadkey values minimizes memory overhead by removing the need to store keys, as quadkeys are easily computable on-the-fly using only very inexpensive bitwise operations. Additionally, such a tree is balanced by default, providing a performance advantage over those spatial trees which require active rebalancing.

An additional performance increase can be obtained by using a pointer-bump allocator rather than repeatedly calling malloc or new. Such an allocator would ideally also allow returning 32-bit addresses, as this can save a significant amount of space on 64-bit systems.

Insertion and retrieval with such a quadtree is rather simple. For a 2N-ary tree with even N, each branch recursively divides the space into a 2N/2 x 2N/2 grid of descendants indexed by the subsequent N bits of the quadkey. Since the value of N here dictates the tree fanout (and therefore height), it has a significant impact on performance - taller trees require more instructions to traverse, and increase the length of the inherent load-dependency chain.

Although 64-ary and bigger have a bit too much overhead for my taste, let's try both 4-ary and 16-ary trees!

Grid fill simulation test, 4-ary tree
1024x1024 2048x2048 4096x4096
Zen 3 - Ryzen 5700x 29 ms 119 ms 590 ms
Apple M3 - M3 Pro 40 ms 112 ms 562 ms
Pentium 3 (Tualatin 1.13GHz) 439 ms 1806 ms 7686 ms
Grid fill simulation test, 16-ary tree
1024x1024 2048x2048 4096x4096
Zen 3 - Ryzen 5700x 18 ms 65 ms 276 ms
Apple M3 - M3 Pro 27 ms 66 ms 277 ms
Pentium 3 (Tualatin 1.13GHz) 294 ms 1420 ms 5188 ms

alternative 2: hash tables

Hash tables are funny. While the O(N) cost of rehashing is often mentioned as a significant performance concern, its comparative reality in real-world usage allows us to focus primarily on the O(1) "hot path". Let's start by choosing our algorithms for theoretically ideal cache performance. This means using open addressing to reduce the total number of accesses and indirections compared to separate chaining, and using linear probing to reduce the number of jumps between cache lines.

Grid fill simulation test, hash table
1024x1024 2048x2048 4096x4096
Zen 3 - Ryzen 5700x 21 ms 115 ms 645 ms
Apple M3 - M3 Pro 27 ms 154 ms 938 ms
Pentium 3 (Tualatin 1.13GHz) 784 ms 3331 ms

Unlike with the quadtree, testing the 4096x4096 canvas on the Pentium 3 crashes due to running out of memory! Although 256MB of RAM should be more than enough to store the 2^24 12-byte buckets required, open addressing requires allocating significantly more slots than are actually needed to prevent excessively long probing sequences. While switching to separate chaining would allow the table to fit in memory, this would come at the cost of further decreased performance.

Even when designing for cache efficiency, the inherently inefficient access patterns of hash tables prove much more severe than the load-dependency of the spatial tree. This is unfortunately unavoidable since a well-written hash function relies on a uniformly random distribution, something that directly hinders the spatial and temporal locality that caches rely on. At higher sizes, the core will quickly fill up with as many pending loads as it can wait on in parallel at once, becoming unable to accept any new instruction until it finally receives a response from lower-level cache (or god forbid, main memory).

comparison

Although the quadtree wins in both memory and performance compared to the hash table, we've yet to see how either performs against existing editors. Since the goal of these data structures is to optimize representation of sparse or non-contiguous operations, let's instead pick something unfavorable to us - let's test how much memory various editors need to be able to undo fill-bucketing a 10x10 region!

memory delta test, 10x10 fill bucket
1024x1024 2048x2048 4096x4096
MS Paint (win. 10) 3100 KB 12300 KB 48100 KB
GIMP (RGBA binary32 canvas)1 15700 KB 67000 KB 255600 KB
Pixie 16 KB2 16 KB 16 KB

I wish I had something witty or condescending to say, but I'm just too horrified - GIMP and Paint simply give up and instead back up the ENTIRE FUCKING CANVAS to fill just a 10x10 region!3 This isn't even an uncommon operation, either! Consider that 16MP is tiny for modern photography, and many cameras will use 16 or 32-bit color for their raw formats to preserve as much information as possible!

I can already hear someone saying "why didn't you test photoshop?" which I hope is out of morbid curiosity and not some delusion that it would ever do anything right. Regardless, here you are!

An error dialog from Adobe Photoshop, with the message: 'Could not use the paint bucket because it does not work with 32 bit per channel images (convert image to 8 or 16 bit per channel to edit'
imagine being the one who wrote this dialog message. imagine the shame. imagine the countless nights spent desperate for sleep, desperate for an escape for the neverending doubt and regret and sense of worthlessness.

THEY KNEW! They knew I was coming for them, and they tried to stop me!! But they never will!4

chunking

While the memory utilization here is fantastic, the performance still leaves quite a bit to be desired. Taking a hint from the impact of the dependency chain we saw earlier, we can reduce tree height by storing pixels in NxN "chunks" rather than individually - this will significantly increase performance with little to no memory overhead. 16x16 chunks are ideal for this, as they provide a good performance/memory tradeoff while still being small enough to allow the usage of 8-bit integers for size tracking.

However, we can't just back up every 16x16 region with any changes and be done with it. In the worst case, 255 out of those 256 pixels are unused, wasting almost 4KB per chunk! Instead, we can store a 256-bit marking set to track which pixels in the chunk have actually been changed. Additionally, manually tracking the most recently used chunk can bring significant performance improvements, since most operations exhibit a high degree of spatial locality.

Notably, treating the pixel data as a sparse set during construction is expensive5 as it requires maintaining ordering between the pixel data and the marking set. To avoid this, pre-allocate the entire chunk and sparsify it only when finished.

To put into perspective just how significant these optimizations are, let's repeat the test from the previous section but with full binary32 pixel data instead of integers! This will significantly decrease the cache performance compared to the previous tests, but can be considered a "worst-case" scenario as the vast majority of operations will be simpler than this in practice.

quadtree RGBA grid fill test with chunking, Apple M3 Pro
1024x1024 2048x2048 4096x4096
No chunking 55 ms 121 ms 586 ms
chunk, no cache 9 ms 31 ms 129 ms
chunk, cache 4 ms 16 ms 64 ms

Absolutely incredible. By spending only insignificant fractions of a second, these editors could have avoided wasting hundreds upon hundreds of MB of RAM for no discernable reason whatsoever.

compression

Empirically, most digital art is constructed piecewise from a series of mostly monochromatic blocks, with artists relying instead on layers to achieve complex shading and lighting. This tends to create patterns that are highly compressible, which is incredibly fortunate considering how expensive binary32 image data is! Photographic data, on the other hand, is much harder to compress6 and entirely outside the scope of this project (for now!).

With such relatively small datasets, run-length encoding and color indexing are more than enough for our purposes7. In certain circumstances, however, these compression algorithms allow us to fit the entire chunk data within the object itself (c.f. small-string optimization)! This is especially useful when the canvas itself is paletted, as there's less data to back up.

Given these three flags of "RLE", "in-object", and "indexed", 6 of the 8 possible combinations are useful:
  1. uncompressed array.
  2. indexed array
  3. indexed array, in-object
  4. RLE compression
  5. "bucket fill" mode (single-run non-indexed RLE)8
  6. RLE compression, in-object

Mode 4 (RLE without indexing) is impractical9, and mode 2 (uncompressed array in-object) would only allow storing one color equivalently to mode 6. As another bonus, internal storage has a small performance benefit as well due to avoiding a level of pointer indirection, although this is generally negligible for most realistic canvas sizes.

Since our maximum payload size is always 16 bytes (two 64-bit pointers, one binary32 color), our chunk structure is then as follows:
  • 32-byte marking set
  • 16-byte payload
  • 1-byte mode field
  • N-byte padding

Padding is 3 bytes on 32-bit platforms and 7 bytes on 64-bit platforms, leading to a chunk size of 52 or 56 bytes respectively. It might be worthwhile to round up to 64 bytes to allow more in-object storage, although I don't have enough evidence for this one way or the other.

With that, let's get to compressing! For demonstration, I've chosen some patterns that are very common, but still very favorable for compression:

compression ratios
dataset uncompressed compressed size reduction
8192x8192 bucket fill 1056 MB 32 MB 1/33 (96.9%)
8192x8192 checkerboard dithering 1056 MB 104 MB ~1/10 (90.1%)

Simple, but incredibly effective! Notably, these compression ratios stay mostly the same no matter how large of a canvas the patterns are repeated over.

There are, however, some performance concerns here. First, while these algorithms are simple and cheap, they're far from free - running them on the main thread can significantly increase the perceived latency of operations, especially on older systems. Second, since compression requires per-chunk dynamic allocation, we're once again in need of a pointer bump allocator if we want any realistic performance. Since meeting both of these requirements is far more difficult than it seems (more on that later), let's stick to a single-threaded implementation for now.

further work

And with that, we've done it! We've managed to make a highly efficient edit history that excels at representing both sparse and dense patterns alike. Despite this success, however, there's still a lot more we can improve - I'm personally interested in implementing filtering algorithms to effectively compress photographic deltas, as well as finding effective ways to merge equivalent chunks and palettes to save even more memory.

Most important, however, is developing a pointer-bump allocator that works across multiple threads during compression. The most feasible tool for this task is actually a moving GC, the design of which will receive its own series of articles. Who knows - maybe, using a GC heap, we can also beat these editors in perceived latency even on systems as slow as an i486!

footnotes

  1. ^ Although GIMP defaults to RGB24 and would normally perform identically to Paint, I chose a binary32 canvas simply because Pixie defaults to it. The process of obtaining these results is somewhat haphazard - GIMP's allocation practices appear very inconsistent without a real memory profiler, requiring one to repeat an operation and average out the results to approach anything accurate.
  2. ^ Technically, these results are for a much smaller canvas than is listed - Massif simply refused to report the allocations otherwise, claiming they were below the profiling threshold. Luckily, since this operation is not dependent on canvas size (unlike on SOME editors...), this makes no difference!
  3. ^ This can easily be verified by simply multiplying the amount of pixels by the bitdepth - Paint uses 3 bytes per pixel, and GIMP in this mode uses 16.
  4. ^ They can, at the very least, stop me from profiling with Task Manager - Photoshop allocates a large heap upon startup, making its internal allocations somewhat opaque. Fortunately, that screenshot tells us all we need to know!
  5. ^ Even on my Zen 3 machine with single-cycle popcnt, the frequency of the operation (4 times per pixel) leads to noticeable delays. This slowdown quickly becomes unbearable with anything even slightly slower.
  6. ^ This is especially true considering how fundamentally different the editing workflow for photographs is. Rather than writing over transparent layer backgrounds, most editing instead comes in the form of transformations that affectan entire region of pixels equally, such as filters or the burn tool. Any reasonably effective compression for photographic deltas would probably have to take an entirely different approach than the one proposed here, likely compressing the difference between the two states instead of the old state in isolation. However, I've never tested this, so I can't be certain.
  7. ^ The only other thing we can do is apply mathematical transformations to make reduction and deduplication more effective. For instance, PNG scanline filters serve exactly this purpose. The primary downside of many such algorithms is that to choose the truly optimal settings, one must try *every single* combination to find the optimal settings.
  8. ^ While the high cost of color data normally means that in-object mode is only feasible when the canvas data itself is paletted, 16 bytes is enough to store a singular binary32 color. This is actually far more useful than it might seem, since artists spend a LOT of time drawing on, and therefore backing up, large transparent regions.
  9. ^ The lack of a palette means we must repeat expensive color data on every run, and any data that would benefit from RLE would almost certainly benefit from indexing as well.