Wednesday, February 18, 2015

Searching faster than Binary Search

We all know that Binary Search is asymptotically the fastest way to search for data within a sorted (ordered) array in the comparison (or simple decision tree) model assuming nothing in particular about the data or the data types.

However, there's an entire class of algorithms and data structures that focus on how efficiently they utilize the host system's cache while processing data. These class or algorithms and data structures are called cache efficient algorithms and data structures [pdf].

There's 2 types of cache efficient algorithms and data structures:
  1. One class tunes the algorithm for a particular size of cache or cache hierarchy. The algorithm is aware of the cache sizes, the number of caching levels, and the relative speeds of each of these caching levels.
  2. Another class of algorithms is oblivious of the underlying cache sizes and layouts and are provably optimal for any cache size and caching layout (sounds magical doesn't it!). These are called Cache Oblivious Algorithms. See this link on Cache Oblivious Algorithms for more details, and this link for more details on the model, and the assumptions made in the Cache Oblivious model.

  • An example of a cache-efficient algorithm that is also cache-oblivious is Linear Search.
  • An example of a cache-inefficient algorithms that isn't cache-oblivious is Binary Search.
  • An example of a cache-efficient data structure that isn't cache-oblivious is the B-Tree (since B is the tuning parameter for the particular machine on which we arr running).
Without getting into the details, the complexity of running Binary Search on an array in the Disk Access Model (DAM) (where we are only concerned about the number of disk blocks read, and not the number of comparisons made) is O(logNB)), since we must always load a block from disk till we reach a small enough array (of size B) such that no more jumps within that array will trigger another disk I/O operation to fetch another disk block. The optimal complexity for searching for ordered data on disk is realized by ordering the data recursively in a static search tree such that the complexity reduces to O(logBN).

However, implementing that structure is somewhat complicated, and we should ask ourselves if there is a way to get the best of both worlds. i.e.
  • Runtime efficiency of the cache-oblivious recursive layout, and
  • Implementation simplicity of the standard Binary Search algorithm on an ordered array

Turns out, we can reach a compromise if we use the square-root trick. This is how we'll proceed:

  1. Promote every sqrt(N)'th element to a new summary array. We use this summary array as a first-level lookup structure.
  2. To lookup an element, we perform binary search within this summary array to find the possible extents within the original array where our element of interest could lie.
  3. We then use binary search on that interesting sub-array of our original array to find our element of interest.




For example:

  • Suppose we are searching for the element '7', we will first perform binary search on the top array; i.e. [10,22,33,43] and identify that 7 must lie in the original values sub-array at an index that is before the index of 10. We then restrict our next binary search to the sub-array [5,7,8,10].
  • Suppose we are searching for the element '22', we first identify the sub-array [11,21,22,25,26,33] as potentially containing a solution and perform binary search on that sub-array.

Even though we are asymptotically performing the same number of overall element comparisons, our cache locality has gone up, since the number of block transfers we'll perform for a single lookup will now be 2(log√NB), which is an additive factor of log B lesser than what we had for our normal binary search on the sorted array.


You can find another similar idea named Cache Conscious Binary Search described here.

All the code needed for reproducing the results shown here can be found below:


# Columns:
# log(N) N Time/Iter(usec)[std::lower_bound] N Time/Iter(usec)[l2search]
23 8388608 0.460176 8388608 0.269376
24 16777216 0.621821 16777216 0.435599
25 33554432 0.735565 33554432 0.615605
26 67108864 0.856156 67108864 0.598811
27 134217728 0.959821 134217728 0.636344
28 268435456 1.552199 268435456 0.771024
29 536870912 1.158228 536870912 0.818436
30 1073741824 1.936374 1073741824 0.905859
view raw data.txt hosted with ❤ by GitHub
#include <iostream>
#include <algorithm>
#include <vector>
#include <sys/time.h>
#include <math.h>
#include <memory>
#include <assert.h>
#include <string.h>
#include <stdio.h>
using namespace std;
template <typename T>
struct Search2LevelArray {
T *start, *end;
size_t N;
size_t sqrtN;
std::unique_ptr<T[]> lookup;
Search2LevelArray(T *_s, T *_e)
: start(_s), end(_e), N(_e - _s), sqrtN(ceil(sqrt(N))) {
this->preprocess();
}
void preprocess() {
std::sort(start, end);
if (N > 16 * 1024) {
lookup.reset(new T[sqrtN - 1]);
for (size_t i = 0; i < sqrtN-1; i++) {
size_t idx = (i + 1) * sqrtN - 1;
if (idx >= N) {
idx = N - 1;
}
lookup[i] = start[idx];
}
}
}
T* lower_bound(T const &x) {
if (N <= 16 * 1024) {
return std::lower_bound(start, start+N, x);
}
auto it = std::equal_range(lookup.get(), lookup.get() + sqrtN - 1, x);
size_t b = (it.first - lookup.get()) * sqrtN;
size_t e = (it.second + 1 - lookup.get()) * sqrtN;
if (b >= N) return start+N;
if (e > N) { e = N; }
return std::lower_bound(start+b, start+e, x);
}
};
int main(int argc, char *argv[]) {
int reps = atoi(argv[1]);
int N = atoi(argv[2]);
int nQueries = atoi(argv[3]);
vector<int> v;
v.reserve(N);
for (int i = 0; i < N; ++i) {
v.push_back(i + 1023);
}
vector<int> queries;
queries.reserve(nQueries);
int mod = N;
// mod = sqrt(N);
for (int i = 0; i < nQueries; ++i) {
queries.push_back(rand() % mod);
}
// std::sort(v.begin(), v.end());
Search2LevelArray<int> l2search(&(v.front()), &(v.back()) + 1);
size_t z = 0;
struct timeval start, end;
if (argc > 4 && !strcmp(argv[4], "l2search")) {
gettimeofday(&start, NULL);
for (int i = 0; i < reps; ++i) {
for (int j = 0; j < nQueries; ++j) {
auto one = l2search.lower_bound(queries[j]) - &(v.front());
z += one;
}
}
} else {
gettimeofday(&start, NULL);
for (int i = 0; i < reps; ++i) {
for (int j = 0; j < nQueries; ++j) {
z += std::lower_bound(v.begin(), v.end(), queries[j]) - v.begin();
}
}
}
gettimeofday(&end, NULL);
double startUsec = start.tv_sec * 1000000 + start.tv_usec;
double endUsec = end.tv_sec * 1000000 + end.tv_usec;
std::cerr << z << std::endl;
std::cerr << "Time taken: " << (endUsec - startUsec) << " u-second" << endl;
std::cerr << "Time per iteration: " << (endUsec - startUsec) / (nQueries * reps) << " usec" << endl;
printf("%d\t%d\t%f\n", static_cast<int>(ceil(log(N) / log(2.0))), N,
(endUsec - startUsec) / (nQueries * reps));
}
view raw lb_test.cpp hosted with ❤ by GitHub
# You can execute this script online along with the data in
# data.txt by visiting http://gnuplot.respawned.com/
# Scale font and line width (dpi) by changing the size! It will always display stretched.
set terminal svg size 400,300 enhanced fname 'arial' fsize 10 butt solid
set output 'out.svg'
# Key means label...
set key inside bottom right
set xlabel 'log(N)'
set ylabel 'Time/Iter (usec)'
set title 'Time to search for a key in a static sorted array of size N'
plot "data.txt" using 1:5 title 'l2Search' with linespoints, \
"data.txt" using 1:3 title 'std::lower\_bound' with linespoints
#!/bin/bash
# Run as:
# ./run_lb_test.sh # to run using std::lower_bount
# ./run_lb_test.sh l2search # to run using the 2-level search algorithm
set -e
set -u
REPS=10
NQUERIES=100000
N=$((8 * 1024 * 1024))
g++ lb_test.cpp -O3 -lm -o lb_test -std=c++0x
while [ $N -lt 2000000000 ]
do
./lb_test $REPS $N $NQUERIES "$@"
N=$((N*2))
done
view raw run_lb_test.sh hosted with ❤ by GitHub

No comments: