Sunday, May 08, 2011

Faster Javascript String matching

Ever had to match a lot of strings (say article titles or names of people) to a user entered query? I'm sure you always:
  1. Looped around the values and
  2. Performed a substring search on these strings

Well, it turns out that both are suboptimal. Let us for a minute assume that the data you have is relatively static (doesn't change too much over time) and that queries really should be satisfied quickly. If that is the case, then you would be better off:
  1. Maintaining all the data to be queried in a single string (delimited by a $ or # sign - something that is not expected to be present in the data itself)
  2. Performing a regular-expression match on the aforementioned string

Assuming that the regular expression engine is optimized (and mostly they are), a regexp match would be much faster than a naive O(n2) search.

Let's look at some code and the resulting running time for either approach.

function random_string(len) {
    var s = '';
    for (var i = 0; i < len; ++i) {
        s += String.fromCharCode(Math.round(97 + Math.random()*26));
    }
    return s;
}

var rs = [ ];
var total_length = 0;
// Stores the start index of each string in rs.join('$');
var indexes = [ ];

for (var i = 0; i < 300000; ++i) {
    var s = random_string(40);
    rs.push(s);
    indexes.push(total_length);
    total_length += (s.length + 1);
}

function test_01(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var cnt = 0;
    for (var i = 0; i < rs.length; ++i) {
        var m = rs[i].match(re);
        if (m) {
            cnt += 1;
        }
    }
    return cnt;
}

function test_02(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var s = rs.join('$');
    // Indexes of strings in 'rs' that matched
    var idx = [ -1 ];
    var ii = 0;
    var m = re.exec(s);
    while (m) {
        while (ii < indexes.length && indexes[ii] <= m.index) {
            ++ii;
        }
        if (idx[idx.length-1] != ii) {
            idx.push(ii);
        }
        m = re.exec(s);
    }
    return idx.length - 1;
}

d1 = new Date();
console.log("test_01:", test_01(rs));
d2 = new Date()
console.log("test_01 time:", d2-d1);

d1 = new Date();
console.log("test_02:", test_02(rs));
d2 = new Date()
console.log("test_02 time:", d2-d1);

This is the output that running this code produces:

test_01: 665
test_01 time: 935
test_02: 665
test_02 time: 257

Which means that the longer code is actually 2.63 times faster than the more beautiful looking code.

When the difference between 200ms and 300ms is felt by the user (for UI centric applications), this can be a huge improvement to the perceived responsiveness of the application.

No comments: