Sunday, May 08, 2011

Faster Javascript String matching

Ever had to match a lot of strings (say article titles or names of people) to a user entered query? I'm sure you always:
  1. Looped around the values and
  2. Performed a substring search on these strings

Well, it turns out that both are suboptimal. Let us for a minute assume that the data you have is relatively static (doesn't change too much over time) and that queries really should be satisfied quickly. If that is the case, then you would be better off:
  1. Maintaining all the data to be queried in a single string (delimited by a $ or # sign - something that is not expected to be present in the data itself)
  2. Performing a regular-expression match on the aforementioned string

Assuming that the regular expression engine is optimized (and mostly they are), a regexp match would be much faster than a naive O(n2) search.

Let's look at some code and the resulting running time for either approach.

function random_string(len) {
    var s = '';
    for (var i = 0; i < len; ++i) {
        s += String.fromCharCode(Math.round(97 + Math.random()*26));
    }
    return s;
}

var rs = [ ];
var total_length = 0;
// Stores the start index of each string in rs.join('$');
var indexes = [ ];

for (var i = 0; i < 300000; ++i) {
    var s = random_string(40);
    rs.push(s);
    indexes.push(total_length);
    total_length += (s.length + 1);
}

function test_01(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var cnt = 0;
    for (var i = 0; i < rs.length; ++i) {
        var m = rs[i].match(re);
        if (m) {
            cnt += 1;
        }
    }
    return cnt;
}

function test_02(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var s = rs.join('$');
    // Indexes of strings in 'rs' that matched
    var idx = [ -1 ];
    var ii = 0;
    var m = re.exec(s);
    while (m) {
        while (ii < indexes.length && indexes[ii] <= m.index) {
            ++ii;
        }
        if (idx[idx.length-1] != ii) {
            idx.push(ii);
        }
        m = re.exec(s);
    }
    return idx.length - 1;
}

d1 = new Date();
console.log("test_01:", test_01(rs));
d2 = new Date()
console.log("test_01 time:", d2-d1);

d1 = new Date();
console.log("test_02:", test_02(rs));
d2 = new Date()
console.log("test_02 time:", d2-d1);

This is the output that running this code produces:

test_01: 665
test_01 time: 935
test_02: 665
test_02 time: 257

Which means that the longer code is actually 2.63 times faster than the more beautiful looking code.

When the difference between 200ms and 300ms is felt by the user (for UI centric applications), this can be a huge improvement to the perceived responsiveness of the application.

Computing the Cumulative sum of a range of elements in a binary tree

Let us assume the alternative definition of a BST (Binary Search Tree), namely: "The left subtree shall contain elements less than or equal to the element at the root and the right subtree shall contain elements greater than or equal to the element at the root".

This means that if we want to compute the cumulative sum of all numbers ≤ to a given number, we need to update some metadata every time a node is added to the tree or removed from the tree (and for a balanced BST also when rebalancing occurs). Fortunately, this implementation of the AVL Tree allows us to do just this by way of hook functions that you can pass to it via the constructor.

We define a function binary_tree_summer that computes the cumulative sum of a node and its child nodes and stores in the sum attribute of the node.

To query the cumulative sum, we define a function query_sum that basically employs the following strategy (given our definition of the BST above):
  • If the value at the node is ≤ than the value we are looking for, we take the cumulative sum at the node, subtract from it the sum in it's right subtree (since we don't know how many elements in the right subtree will match our query) and then add the result of querying the right subtree
  • If the value at the node is > the value we are looking for, we are assured that the complete range lies in the node's left subtree, so we query the left subtree

You can see that at every stage, we decide to go either left or right, which means that our runtime is bounded by the height of the tree, which fortunately in case of an AVL Tree is at most O(log n).

The code for such a use-case is shown below:

function cumulative_sum() {
    var items = [ 4, 9, 2, 5, 4, 2, 1, 2, 3, 2, 1, 7, 3, 2 ];

    function binary_tree_summer(node) {
        var _s = (node.left ? node.left.sum : 0) + 
         (node.right ? node.right.sum : 0) + node.value;
        node.sum = _s;
    }

    var tree = new algo.AVLTree(algo.cmp_lt, binary_tree_summer);
    for (var i = 0; i < items.length; ++i) {
        tree.insert(items[i]);
    }

    function query_sum(node, value) {
        // We do the query like we do for a Segment Tree
        if (!node) {
            return 0;
    }

    if (node.value <= value) {
        var sub = node.right ? node.right.sum : 0;
        return node.sum - sub + query_sum(node.right, value);
    }
    else {
        // node.value > value
        return query_sum(node.left, value);
    }
}

console.log("<= 2:", query_sum(tree.root, 2));
console.log("<= 5:", query_sum(tree.root, 5));

Let's look at summing all numbers ≤ 2. The diagram below shows us how the call flow looks like. Note:
  • The green lines are the paths followed down the root
  • The red lines are the paths that are rejected (not followed)
  • The numbers in blue above each node denote the value of the sum attribute for that node


The sum is obtained in the following manner for values ≤ 2: 47-39+4


The diagram below shows us the call flow for summing all numbers ≤ 5.


The sum is obtained in the following manner for values ≤ 5: 47-39+39-25+25-16

Binary Trees with Repeated elements

In school/college, we all learnt about Binary Trees and while we were taught how to implement them, the implicit assumption was that elements would not be repeated. However, in the real-world, we need to deal with repeated elements all the time. Most standard library implementations of popular languages (read Java) provide a TreeMap that is reduced to something very powerless because it doesn't support repeated elements!! The C++ std::multimap class however supports repetitions and is IMHO much more powerful than any other Balanced Binary Tree implementation that I have come across.

Today, I shall make an attempt to de-mistify the process of creating a Binary Tree with repeated elements. It's really simple and I think that the process can be summarized in one sentence: "Instead of the left subtree containing elements strictly less than the element at the root and the right subtree containing elements strictly greater than the element at the root, the left subtree shall contain elements less than or equal to the element at the root and the right subtree shall contain elements greater than or equal to the element at the root". Simple, isn't it!!

Now, let's see why it works.

The simple case of asking whether an element exists or not can be easily implemented the same way as before. If the element is present, we shall hit it while traversing down the tree.

Finding the first instance of an element (if it exists) is called locating the lower_bound of that element. This can be accomplished by running down the left of the node till you have an element at the node that is greater than or equal to the element to be found. We keep a track of the node every time we take a left turn. Othwerise (the element at the node is less than the element we are interested in), we go down the right. Since the lower_bound of an element is never less than the element itself, we don't keep track of nodes when we take a right turn.

Finding one-past the last instance of an elemnt (if it exists) is called locating the upper_bound of that element. This can be accomplished by running down the left of the node till you have an element at the node that is greater than the element to be found. We keep a track of the node every time we take a left turn. Othwerise (the element at the node is less than or equal to the element we are interested in), we go down the right. Since the upper_bound of an element is never less than or equal to the element itself, we don't keep track of nodes when we take a right turn.

On the following pages, can find very high quality implementations of a Red-Black Tree and an AVL Tree respectively that both support repeated elements as well as the lower_bound and upper_bound functions.

No special handling is required during re-balancing the tree during an insertion or a deletion.

Here is a sample instance of the AVL Tree after the following elements have been inserted into it in this same order (the numbers after the - sign are just to disambiguate nodes with the same key while plotting the tree):

4, 9, 2, 5, 4, 2, 1, 2, 3, 2, 1, 7, 3, 2