An index is something that lets you get at data quickly. I consider a B-Tree as well as a sorted array an index. The time to lookup an element in an array is O(log2 N) whereas that in a B-Tree is O(logB N), where N & B mean their usual things.
-
If we just consider static structures (sorted array for example), and we wish to compress the data but maintain the retrieval speed, one can think of compressing blocks of size B and then performing a binary search over blocks. There are O(N/B) such blocks, so the cost of performing a binary search is O(log2 N/B), which involves decompressing O(log2 (N/B)) blocks of size B each.
Instead of B, if we chose O(log2 N) elements to fit in a block (intuitively, assume that O(log2 N) = kB, for some constant k > 0), we land up with O(N/log2 N) blocks, each of size O(log2 N), and the asymptotic cost to query remains the same, but we decompress O(log2 (N/log2 N)) blocks this time, which is asymptotically better than before. -
If OTOH, we sample out every O(log2 N)th element from the sorted array, and create an uncompressed array from these elements, and make each element point to a compressed block containing all the elements between this element and the next element, then each element points to a compressed block containing O(log2 N) elements.
Queries can be answered by performing a normal binary search on the array of sampled elements, and then at most 1 decompression operation is sufficient to locate the element in the compressed block. This is better than the O(log2 (N/log2 N)) blocks we had to decompress in the earlier scheme. The asymptotic cost of a query remains the same but we gain a lot by performing exponentially fewer decompression operations!!
Credits to Anirban for this explanation: Another way to look at it is to imagine a B-Tree built from these sampled elements, and each node in this B-Tree points to a compressed block containing O(log2 N) elements. Alternatively, this can also be seen as a B-Tree built in such a way that internal nodes are stored uncompressed, but the leaf nodes are stored compressed.
No comments:
Post a Comment