Sunday, December 09, 2012

Better Index Compression

It's late, so I'll keep it short and to the point.

An index is something that lets you get at data quickly. I consider a B-Tree as well as a sorted array an index. The time to lookup an element in an array is O(log2 N) whereas that in a B-Tree is O(logB N), where N & B mean their usual things.

  1. If we just consider static structures (sorted array for example), and we wish to compress the data but maintain the retrieval speed, one can think of compressing blocks of size B and then performing a binary search over blocks. There are O(N/B) such blocks, so the cost of performing a binary search is O(log2 N/B), which involves decompressing O(log2 (N/B)) blocks of size B each.

    Instead of B, if we chose O(log2 N) elements to fit in a block (intuitively, assume that O(log2 N) = kB, for some constant k > 0), we land up with O(N/log2 N) blocks, each of size O(log2 N), and the asymptotic cost to query remains the same, but we decompress O(log2 (N/log2 N)) blocks this time, which is asymptotically better than before.
  2. If OTOH, we sample out every O(log2 N)th element from the sorted array, and create an uncompressed array from these elements, and make each element point to a compressed block containing all the elements between this element and the next element, then each element points to a compressed block containing O(log2 N) elements.

    Queries can be answered by performing a normal binary search on the array of sampled elements, and then at most 1 decompression operation is sufficient to locate the element in the compressed block. This is better than the O(log2 (N/log2 N)) blocks we had to decompress in the earlier scheme. The asymptotic cost of a query remains the same but we gain a lot by performing exponentially fewer decompression operations!!

    Credits to Anirban for this explanation: Another way to look at it is to imagine a B-Tree built from these sampled elements, and each node in this B-Tree points to a compressed block containing O(log2 N) elements. Alternatively, this can also be seen as a B-Tree built in such a way that internal nodes are stored uncompressed, but the leaf nodes are stored compressed.