Some thoughts on your (and my) thoughts.: 2011

Monday, October 24, 2011

What business do you want to be in?

Dr. Bender once mentioned in class that memories are created when one experiences ~~joy~~ strong emotions, and in the context of algorithms, when one discovers a solution to some problem.

About a month down, I've noticed that experiences are what I'm going to die with. I vividly remember this episode in school when we were in the middle of the gymnastics competition and were one short of someone on the roman rings. I decided to give it a shot (even though I had never tried it before) and 2 minutes later I found myself stuck "up there". I don't remember all the things that I did right or who eventually won, but I surely do remember getting stuck!

If you are from my school and remember this episode, please leave a comment. I'll be looking forward to hearing from you!

I want to be in the business of creating memories.

Edit: Fixed the incorrect quote in the first line.

Tuesday, September 20, 2011

The Integer (0/1) Bounded Knapsack Problem

Quoting the problem statement from Petr's blog:

"The bounded knapsack problem is: you are given n types of items, you have u_i items of i^th type, and each item of i^th type weighs w_i and costs c_i. What is the maximal cost you can get by picking some items weighing at most W in total?"

Here, we assume that the knapsack can hold a weight at most W.

We shall describe 3 solutions to this problem, each of which reduces the complexity of the solution by a certain amount.

O(W*n*M): The first solution is the standard knapsack solution solution, which just involves copying each item instance of every item type to get a total of u_i items of the i^th type, each of which represents a different item having the same w_i & c_i as the item of the i^th type.

It is easy to see that if we have a mean of M number of items of each type, the complexity of this solution is O(W * n * M)
O(W*n*log(M)): The second solution involves rewriting items such that we have a fewer number of items of each type to work with. We shall then solve the transformed problem using the same technique as the standard knapsack problem (as in the solution above).

Suppose we have 21 instance of the i^th item type, weighing w_i and costing c_i. We observe that 21 can be rewritten (using powers of 2) as:
21 = 1 + 2 + 4 + 8 + 6

We basically add powers of 2 till we exceed the sum we want to create (21 in this case). When we exceed the sum, we just add the difference (6 in this case) between the number (21 in this case) and the current sum we have (15 in this case = 1+2+4+8).

Now, the new items are created by combining 1, 2, 4, 8, and 6 items of the original instances. These newly created items will have the following weights and costs:

Number of items of the original type combined Weight Cost

1 1*w_i 1*c_i

2 2*w_i 2*c_i

4 4*w_i 4*c_i

8 8*w_i 8*c_i

6 6*w_i 6*c_i

Hence, we have effectively reduced the number of items from 21 to 5. In general, we replace the M in the solution above with log(M) to get a final complexity of O(W * n * log(M)).
O(W*n): The third and most efficient solution to this problem is the one that took me almost 3 days to figure out and understand. The short post on petr's blog actually contains all the information needed to understand it, but I'm not that comfortable reading terse explanations; especially when it's a new idea, so here is something more verbose.

There is also a paper and a section in a book dedicate to this problem, but I find them quite hard as well.

Let's assume that this row (in the DP table of Items (rows) v/s Weight (columns)) represents the best configuration (maximum cost) for the knapsack at every weight 0 .. W that includes all item types from 1 .. (k-1)

Weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Item (k-1) 0 5 5 7 15 15 19 20 21 23 30 33 42 42 49 51

So, the basic observation is that DP_{(k-1), w} represents the best solution including all items from 1 .. (k-1) (both inclusive) with a knapsack of weight (w).

To compute DP_{k, w}, we consider the possibility of including either 0, 1, 2, 3, 4, ..., u_i items of the i^th type to the previous best solution including all items from 1 .. (k-1). This gives us the following recurrence:

DP_{k, w} = max (
    DP_{(k-1), w},
    DP_{(k-1), w-1*w_k} + 1*c_k,
    DP_{(k-1), w-2*w_k} + 2*c_k,
    DP_{(k-1), w-3*w_k} + 3*c_k,
    ...,
    DP_{(k-1), w-u_i*w_k} + u_i*c_k)

An interesting observation here would be that we only deal with every w_k^th element from the previous array to decide on any element in the current array.

Let us for a minute ignore the fact that we have only a fixed number of items of each type and instead assume that we have an infinite number of items of each type. Let us assume that item i weighs 3 and costs 9. If we make that assumption, we see that the best solution at DP_{k, 12} is represented by taking the maximum of the cells colored in green below.

Weight 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Item (k-1) 0 5 5 7 15 15 19 20 21 23 30 33 42 42 49 51

36 34 37 32 42

What we have essentially done is added:
0*9 to DP_{(k-1), 12}
1*9 to DP_{(k-1), 9}
2*9 to DP_{(k-1), 6}
3*9 to DP_{(k-1), 3} and
4*9 to DP_{(k-1), 0}

The problem has reduced to finding the maximum value among all the values of the array that is formed by picking every w_k^th element from the previous best solution and adding some multiple of c_k to it.

We also notice that it doesn't matter what values we add to DP_{(k-1), w} (for the k^th item) as long as they are c_k apart. Hence, if the weight of the knapsack is W and the weight of the k^th item is w_k, we can add the following to every w_k^th element in the previous array.

Column to which value is added Value Added

0 (W/w_k)*c_k

1*w_k (W/w_k - 1)*c_k

2*w_k (W/w_k - 2)*c_k

3*w_k (W/w_k - 3)*c_k

If we now reintroduce the requirement that each item type will have only a bounded or fixed number of instances (i.e. u_k is bounded), then the problem reduces to finding the maximum value in every window of size u_k in the array we just computed. This can easily be done by pre-processing the array to maintain the largest element in every window of size u_k.

However, the actual value of the cost at any cell needs to be computed by adding back any differential that we may have subtracted to ensure that every entry is c_k apart and that the first entry is the one which has the maximum multiple of c_k subtracted from it. For the j^th array element in the current sub-array we are processing, this value will be (W/w_k - j)*c_k (since we have subtracted that much extra from those elements, we must now add it back).

Hence, we can safely say that we shall have w_k different sub-problems that we shall try to solve assuming that we can use a single computation for every w_k^th element.

This allows us to find the value at every cell of the (k x W) matrix is amortized time O(1) (we have amortized the cost of pre-computing the w_k arrays across all the element accesses). This makes the total running time of this algorithm O(W * n).

Number of items of the original type combined	Weight	Cost
1	1*w_i	1*c_i
2	2*w_i	2*c_i
4	4*w_i	4*c_i
8	8*w_i	8*c_i
6	6*w_i	6*c_i

Weight	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Item (k-1)	0	5	5	7	15	15	19	20	21	23	30	33	42	42	49	51

Weight	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
Item (k-1)	0	5	5	7	15	15	19	20	21	23	30	33	42	42	49	51
	36			34			37			32			42

Column to which value is added	Value Added
0	(W/w_k)*c_k
1*w_k	(W/w_k - 1)*c_k
2*w_k	(W/w_k - 2)*c_k
3*w_k	(W/w_k - 3)*c_k

Thursday, July 14, 2011

Vegetable Ratatouille

Here we go!!

So, we'll be following this recipe for the most part. There are some diversions that I shall mention here. These are the only things I changed from the recipe above.

Tomatoes: We shall make a Ratatouille that has a red (tomato) gravy and the vegetables will be added to that gravy. For this, you'll need more tomatoes (at least 4 more).

Basil: I used dry basil instead of fresh basil, since I couldn't find any where I stay. I used lesser oil at every stage (use about half the amount of oil everywhere).

Olive Oil: Do NOT use Extra-Virgin Olive Oil for cooking. Do NOT use Olive-Pomace Oil since it doesn't have a good taste. Use ONLY Pure-Olive-Oil.

Salt: ALWAYS use Salt to-taste instead of measuring it since the saltiness of salt varies from region to region.

Ginger: I also use grated ginger (or ginger paste) along with the garlic pieces.

Bay Leaf: I also used bay-leaves (3 nos.) while sautéing the onions (along with the garlic and ginger).

After the onions are half done, add pieces of (at least) 4 tomatos and cook till the tomatoes are soft. Mash them to form a paste. Follow the recipe mentioned here. Once the other vegetables are cooked, add them to this tomato gravy and cook for about 3 mins. (or till steaming). Cover and let the vegetables soak in the smell for about 15 mins. Serve with rice (you can use brown rice since it tastes really good and is a healthier option when compared with white rice).

Wednesday, July 06, 2011

You know you are working with the right people when ...

... you get this in the post :) :)

Thanks @yegg :-)

Friday, June 10, 2011

A plea to all software developers: Use Units

Have you ever come across configuration files that read:

MAX_BUFFER_SIZE = 4096
MAX_TIMEOUT     = 20

WTF!!

4096 what?? bits, bytes, kilobytes, megabytes?

20 what?? mill-second, second?

Please, please make units part of the config. option and say:

MAX_BUFFER_SIZE_BYTES = 4096
MAX_TIMEOUT_MILLISEC  = 20

Sunday, May 15, 2011

Why DuckDuckGo is good for mother

My mother wastes a lot of time online visiting MFA sites because she can't differentiate between the legitimate ones and the scammy ones. I've set up DuckDuckGo for her as her default search engine and she's been complaining lesser about "Why does this site not have what I want?".

I also feel that DDG can have a bang page organized by target audience (such as Mother/Father/Student, etc...). For Mother, these would need to go in:

!allrecipes
!epicurious
!w
!yt
!lyrics
!gimages
!gmaps
!bloomberg
!reuters
!songmeanings
!news

It would surely make it easier for mother to navigate the shortcuts!!

Sunday, May 08, 2011

Faster Javascript String matching

Ever had to match a lot of strings (say article titles or names of people) to a user entered query? I'm sure you always:

Looped around the values and
Performed a substring search on these strings

Well, it turns out that both are suboptimal. Let us for a minute assume that the data you have is relatively static (doesn't change too much over time) and that queries really should be satisfied quickly. If that is the case, then you would be better off:

Maintaining all the data to be queried in a single string (delimited by a $ or # sign - something that is not expected to be present in the data itself)
Performing a regular-expression match on the aforementioned string

Assuming that the regular expression engine is optimized (and mostly they are), a regexp match would be much faster than a naive O(n²) search.

Let's look at some code and the resulting running time for either approach.

function random_string(len) {
    var s = '';
    for (var i = 0; i < len; ++i) {
        s += String.fromCharCode(Math.round(97 + Math.random()*26));
    }
    return s;
}

var rs = [ ];
var total_length = 0;
// Stores the start index of each string in rs.join('$');
var indexes = [ ];

for (var i = 0; i < 300000; ++i) {
    var s = random_string(40);
    rs.push(s);
    indexes.push(total_length);
    total_length += (s.length + 1);
}

function test_01(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var cnt = 0;
    for (var i = 0; i < rs.length; ++i) {
        var m = rs[i].match(re);
        if (m) {
            cnt += 1;
        }
    }
    return cnt;
}

function test_02(rs) {
    // Returns the number of elements in 'rs' that match the regexp /ret/
    var re = /ret/g;
    var s = rs.join('$');
    // Indexes of strings in 'rs' that matched
    var idx = [ -1 ];
    var ii = 0;
    var m = re.exec(s);
    while (m) {
        while (ii < indexes.length && indexes[ii] <= m.index) {
            ++ii;
        }
        if (idx[idx.length-1] != ii) {
            idx.push(ii);
        }
        m = re.exec(s);
    }
    return idx.length - 1;
}

d1 = new Date();
console.log("test_01:", test_01(rs));
d2 = new Date()
console.log("test_01 time:", d2-d1);

d1 = new Date();
console.log("test_02:", test_02(rs));
d2 = new Date()
console.log("test_02 time:", d2-d1);

This is the output that running this code produces:

test_01: 665
test_01 time: 935
test_02: 665
test_02 time: 257

Which means that the longer code is actually 2.63 times faster than the more beautiful looking code.

When the difference between 200ms and 300ms is felt by the user (for UI centric applications), this can be a huge improvement to the perceived responsiveness of the application.

Computing the Cumulative sum of a range of elements in a binary tree

Let us assume the alternative definition of a BST (Binary Search Tree), namely: "The left subtree shall contain elements less than or equal to the element at the root and the right subtree shall contain elements greater than or equal to the element at the root".

This means that if we want to compute the cumulative sum of all numbers ≤ to a given number, we need to update some metadata every time a node is added to the tree or removed from the tree (and for a balanced BST also when rebalancing occurs). Fortunately, this implementation of the AVL Tree allows us to do just this by way of hook functions that you can pass to it via the constructor.

We define a function binary_tree_summer that computes the cumulative sum of a node and its child nodes and stores in the sum attribute of the node.

To query the cumulative sum, we define a function query_sum that basically employs the following strategy (given our definition of the BST above):

If the value at the node is ≤ than the value we are looking for, we take the cumulative sum at the node, subtract from it the sum in it's right subtree (since we don't know how many elements in the right subtree will match our query) and then add the result of querying the right subtree
If the value at the node is > the value we are looking for, we are assured that the complete range lies in the node's left subtree, so we query the left subtree

You can see that at every stage, we decide to go either left or right, which means that our runtime is bounded by the height of the tree, which fortunately in case of an AVL Tree is at most O(log n).

The code for such a use-case is shown below:

function cumulative_sum() {
    var items = [ 4, 9, 2, 5, 4, 2, 1, 2, 3, 2, 1, 7, 3, 2 ];

    function binary_tree_summer(node) {
        var _s = (node.left ? node.left.sum : 0) + 
         (node.right ? node.right.sum : 0) + node.value;
        node.sum = _s;
    }

    var tree = new algo.AVLTree(algo.cmp_lt, binary_tree_summer);
    for (var i = 0; i < items.length; ++i) {
        tree.insert(items[i]);
    }

    function query_sum(node, value) {
        // We do the query like we do for a Segment Tree
        if (!node) {
            return 0;
    }

    if (node.value <= value) {
        var sub = node.right ? node.right.sum : 0;
        return node.sum - sub + query_sum(node.right, value);
    }
    else {
        // node.value > value
        return query_sum(node.left, value);
    }
}

console.log("<= 2:", query_sum(tree.root, 2));
console.log("<= 5:", query_sum(tree.root, 5));

Let's look at summing all numbers ≤ 2. The diagram below shows us how the call flow looks like. Note:

The green lines are the paths followed down the root
The red lines are the paths that are rejected (not followed)
The numbers in blue above each node denote the value of the sum attribute for that node

The sum is obtained in the following manner for values ≤ 2: 47-39+4

The diagram below shows us the call flow for summing all numbers ≤ 5.

The sum is obtained in the following manner for values ≤ 5: 47-39+39-25+25-16

Binary Trees with Repeated elements

In school/college, we all learnt about Binary Trees and while we were taught how to implement them, the implicit assumption was that elements would not be repeated. However, in the real-world, we need to deal with repeated elements all the time. Most standard library implementations of popular languages (read Java) provide a TreeMap that is reduced to something very powerless because it doesn't support repeated elements!! The C++ std::multimap class however supports repetitions and is IMHO much more powerful than any other Balanced Binary Tree implementation that I have come across.

Today, I shall make an attempt to de-mistify the process of creating a Binary Tree with repeated elements. It's really simple and I think that the process can be summarized in one sentence: "Instead of the left subtree containing elements strictly less than the element at the root and the right subtree containing elements strictly greater than the element at the root, the left subtree shall contain elements less than or equal to the element at the root and the right subtree shall contain elements greater than or equal to the element at the root". Simple, isn't it!!

Now, let's see why it works.

The simple case of asking whether an element exists or not can be easily implemented the same way as before. If the element is present, we shall hit it while traversing down the tree.

Finding the first instance of an element (if it exists) is called locating the lower_bound of that element. This can be accomplished by running down the left of the node till you have an element at the node that is greater than or equal to the element to be found. We keep a track of the node every time we take a left turn. Othwerise (the element at the node is less than the element we are interested in), we go down the right. Since the lower_bound of an element is never less than the element itself, we don't keep track of nodes when we take a right turn.

Finding one-past the last instance of an elemnt (if it exists) is called locating the upper_bound of that element. This can be accomplished by running down the left of the node till you have an element at the node that is greater than the element to be found. We keep a track of the node every time we take a left turn. Othwerise (the element at the node is less than or equal to the element we are interested in), we go down the right. Since the upper_bound of an element is never less than or equal to the element itself, we don't keep track of nodes when we take a right turn.

On the following pages, can find very high quality implementations of a Red-Black Tree and an AVL Tree respectively that both support repeated elements as well as the lower_bound and upper_bound functions.

No special handling is required during re-balancing the tree during an insertion or a deletion.

Here is a sample instance of the AVL Tree after the following elements have been inserted into it in this same order (the numbers after the - sign are just to disambiguate nodes with the same key while plotting the tree):

4, 9, 2, 5, 4, 2, 1, 2, 3, 2, 1, 7, 3, 2

Saturday, May 07, 2011

Computing the Running Median of an Array of numbers

The problem of computing the running median of an array of numbers involves choosing a window size for which you want to compute the median for. For example, suppose you have the following list of 10 integers:

4, 6, 99, 10, 90, 12, 17, 1, 21, 32

and you want to compute the running median for a window of size 7, you will get the following list of 4 medians:

Median for 4, 6, 99, 10, 90, 12, 17  : 12
Median for 6, 99, 10, 90, 12, 17, 1  : 12
Median for 99, 10, 90, 12, 17, 1, 21 : 17
Median for 10, 90, 12, 17, 1, 21, 32 : 17

The naive way would be to always sort the array and pick the middle element. This costs O(n.log n) per operation (when an integer leaves and enters the window). A slightly smarter solution will try to keep the array sorted and remove and insert the integers in O(n) time. Even though this is better, we hope to do better.

You can find many resources online that try to solve the problem, but surprisingly, very few mention an O(log n) solution (per insertion and removal of an integer), which I believe is quite easy to implement.

I shall first describe a general (data structure independent) procedure for computing the running median of a set of integers and provide other specific tricks that use a balanced binary tree to achieve the same.

General Technique:
The general technique involves keeping the data (we assume that all integers are distinct for the sake of simplicity) partitioned into 3 sets, namely:

All elements less than the median
The median
All elements greater than the median

Set-1 should support the following operations with the below mentioned runtime complexity:

Insertion of an element: O(log n)
Removal of an arbitrary element (given its handle - the handle is returned by the data structure when the element is inserted): O(log n)
Query the maximum valued element: O(log n)

Any operation on Set-2 is pretty trivial to implement, so I'll skip to the next set.

Set-3 should support the following operations with the below mentioned runtime complexity:

Insertion of an element: O(log n)
Removal of an arbitrary element (given its handle - the handle is returned by the data structure when the element is inserted): O(log n)
Query the minimum valued element: O(log n)

What we shall do is initially sort the first 'k' (window size) elements in the stream and create the 3 sets as described above. We also maintain a queue of handles as returned by each of the data structures managing Sets 1 & 3. This will help us remove the oldest element from these sets.

When a new integer is to be inserted, we follow these steps:

Get the handle to the oldest element and remove it from the set in which it currently is
Insert the new element in Set-1 (or Set-3) and push the returned handle into the queue
If the median element goes missing, we pick up the largest element from Set-1 (or the smallest element from Set-3), remove that from the set and set that as the median element
If we notice that the sizes of Sets 1 & 3 are not within 1 of each other, we shift the elements via the median from one set to the other
We always shift the largest element from Set-1 or the smallest element from Set-3
We ensure that the handles get updated whenever the elements move from one set to the other. This requires the handle to be smart and contain a reference to its parent set

If we follow the steps mentioned above, we are ensured that the median is always present in Set-2 (Set-2 always has just 1 element).

A Max-Heap can be used to implement Set-1 and a Min-Heap can be used to implement Set-3. Alternatively, a Min-Max-Heap can be used to implement both. We can also use a balanced binary tree (such as an AVL Tree, Red-Black Tree, AA Tree) to implement Sets 1 & 3.

Balanced Binary Tree Technique:
We can use a Balanced Binary Tree such as an AVL Tree or a Red-Black Tree with order-statistic querying to implement the running median algorithm. Insertions and deletions can be performed in O(log n) time (as before, we maintain a queue of the elements - we do not need handles in this case since a balanced binary tree can delete an element in O(log n) given the element's value). Furthermore, we can query the tree for the WINDOW_SIZE/2 ranked element (which is the median) at every stage. This is by far the simplest and cleanest technique. You can find an implementation of an AVL Tree that supports querying the k-th smallest element (for any k) here.

Example code to use the AVLTree linked above:

var algo = require('algorithm');
var items = [4, 6, 99, 10, 90, 12, 17, 1, 21, 32];

function print_median(tree) {
  console.log("Median:", tree.find_by_rank(Math.ceil(tree.length / 2)));
}

function running_median(items, window_size) {
  var tree = new algo.AVLTree();
  for (var i = 0; i < items.length; ++i) {
    tree.insert(items[i]);
    if (tree.length == window_size) {
      print_median(tree);
      tree.remove(items[i - window_size + 1]);
    }
  }
}

running_median(items, 7);

Monday, April 18, 2011

Bajra Roti/Parantha

Ingredients:

Bajra atta (Pearl millet flour):	2 cups
Water:	Enough to bind the flour (keep 2 cups handy)

Procedure:

Add little water and try to bind the flour. Keep adding a water as and when required
Once the flour is hydrated, add a little more water to make it smooth
Take a plastic sheet and place it on the chakla (flat circular rolling board)
Ensure that you heat the tava (flat skillet) before you begin. Turn the flame to low when it is heated up
Take about 100-150g of dough and place it on the plastic sheet
Sprinkle some millet flour on top of it so that it doesn't stick to your hand
Using your hand, gently press down the dough ball into a flat circular disc - rotating the dough all the time. It should be easy to rotate the dough since the plastic sheet will slide very easily on the rolling board
Keeping the flame on low, place the flattened dough (top side down) on the tava
Once you see small bubbles popping up on the surface, turn the roti over
Heat till you see brown spots on the under-surface
Turn the flame to high and momentary place the roti (turned over again) on the flame directly. This should make the roti balloon. The ballooning of the roti means that it has been cooked properly
Turn over to other side to cook consistently (as desired - but don't burn it please!!)
Serve with dahi (curd), butter, jaggery and vegetables. Enjoy!!

News these days

Being the beneficiary of a headache yesterday, I had the privilege of being unproductive and just fixing myself in front the of television set and watching the daily news with the rest of the family. The news reader almost mechanically read out these 2 news items back to back (in the same segment):

Police firing in The Ratnagiri district of Maharashtra kill 1 and injures 5, and
Indian Ministers oppose the wastage of food at weddings and ask the organisers to provision accurately

It just seems so weird that these 2 news items come together - any way you look at it!! And I'm not even going into whether the people opposing the wastage are themselves practicing what they preach.

Sunday, April 17, 2011

Segment Trees and Bitmaps

Bitmaps are used all the time to keep track of free lists since they are a very compact way of keeping track of true/false information. They are used in the ext2/ext3 file system to keep track of free inode and data blocks, in the bitmapped_allocator to keep track of free memory blocks, and in many many other places.

However, in almost all places, a linear search technique is used to get the first set (or unset) bit. This works very well for small sets of data, but say you have 40 million items that you want to keep track of, you'll land up linear searching 40x10⁶/8/1024/1024 = 4.7 MB of data, which is more than the cache size of most CPUs these days. What this effectively means is that you will land up filling up the CPU cache with the bitmaps, leaving very ittle space for your application's real data

.

This is obviously bad news and we should do something to fix it.

Welcome Trees!!

We take some inspiration from Segment Trees and use a hierarchial structure to store range data that will help us divide-and-conquor the free list in a zippy fashion.

Let's assume that each byte on our fictitious machine can hold just 2 bits. A bit 0 means that the object that that bit represents is free, whereas a bit 1 means that the object that that bit represents is taken (or allocated).

The root node gives us information about the complete range (all the 8 bits in the leaves). The nodes at the 1^st level give us information about 4 bits each, and each of the leaf nodes give us information about 2 bits each (this is a base case since each byte on our machine can hold just 2 bits).

Now, finding a free node is just a simple tree traversal down a path that has 0 in one of the bit positions. While coming up, we set the 0 bits in each level (that we followed down) to a 1 bit ONLY if all the children of that 0 bit are all 1. We unconditionally set the 0 bit in the leaf node to 1 and go up the tree.

Hence, you can see that we have converted a O(n) operation to a O(lg n) operation by increasing the space requirements to 2x the original. Even though the space requirement is 2x the original, the tree based algorithm is much better not just from a time complexity perspective, but is also much more cache friendly. You should expect to have only the first 2 levels entirely in the cache. The remaining levels will come and go as necessary. This is the exact reason why B-Trees are better as an on-disk indexing and retrieval structure.

On a 32-bit machine, we need just a little over 5 levels of the tree to store 40 million bits (34 million bits can be stored completely in just 5 levels).

Update: You can achieve the same runtime complexity of O(lg n) and NO extra space overhead (compared to the linear scan solution) by using a Fenwick Tree instead of a Segment Tree.

Saturday, April 16, 2011

Lazy Logging on Javascript (2)

A few weeks ago, I wrote about how to do lazy logging in Javascript by passing functions to the logging functions. These functions would actually do the CPU intensive evaluation (if needed) and return something that would subsequently be printed on the screen. The key thing to notice is that the item(s) to be printed would be computed ONLY if they were actually going to be printed (using the current logging level and the 1^st parameter to the logging function).

I noticed that a lot of code had started looking quite ugly.

log_it("DEBUG", function() {
  return [ some_var1.toString(), some_var2.toString() ];
});

Note: Any expensive computation that should be performed MUST go inside the toString() method of some_var1 and some_var2.

I have instead decided to solve it by making a function do what the caller was earlier doing:

// toString() Promise
function tsp() {
  var args = arguments;
  return {
    toString: function() {
      return Array.prototype.slice.call(args).map(function(v) {
        return v.toString();
      }).join(' ');
    }
  };
}

As a result of this modification, all code that does logging now looks like this:

log_it("DEBUG", tsp(some_var1, some_var2));

The reason why this works is because log_it() is going to call toString() (if required) on each of its arguments before printing them.

Much neater eh?

Thursday, April 07, 2011

A comment on an email signature

I've added this signature to my outgoing emails at work

n="63686180107683"; n.split('').splice(0,7).map(function() {
   t=n%100; n=n/100; return parseInt(t);
}).reverse().map(function(v) {
   return String.fromCharCode(v+6*6);
}).join('');

and was pleasantly surprised to receive a fairly long and detailed analysis of what Joel Rosario had to say about it.

I wonder how many people have run and tried to understand your signature. I just did. It is devious, and evil.

* What you really want is a loop. Why splice the large string from 0 to 7? It unnecessarily misleads the reader into thinking you're doing a meaningful string operation.

* Why use map with an anonymous function that takes no parameter? This trips up the unwary reader into reading the function, and wondering what it is being applied to.

* The use of n % 100 and n=n/100 seems to show off javascript's perlish ability to treat strings as numbers.

* The above might also confuse the hapless reader. What happens when you cross a string with a number, hmmm?

* By this time, having nearly given up trying to understand what's going on, the reader is hit with reverse(), surprised by a map(), and presented with a function that expects a parameter that is clearly an integer. The arithmetic operations of +6*6 applied to an already mystical parameter v will be too much for any sane person to bear. By now your reader has already succumbed to his fear and fled for a coffee break.

Why lay such traps for the unwary? Why risk the equanimity, the morale, indeed the sanity of your esteemed colleagues? Why couldn't you just have signed your mail with console.log('chat.pw')?

- Joel Rosario.

Needless to say, some of my emails will randomly have console.log("chat.pw") in them ;)

Emails like this make my day; nay week!!

Saturday, March 26, 2011

Lazy Logging on Javascript

One of the things that I like about C/C++ is compile time macros. What they mean is that if I have debug print statements, not only will they be compiled out, but also the expressions passed to the printing statement will not be evaluated.

I have code like this in javascript:

var logging = false;
function log() {
  if (logging)
    console.log.apply(console, arguments);
}

var xml = an_xml_stanza();
log("XML Stanza:", xml.toString());

Even if logging is false, the expression xml.toString() will be evaluated, which can be quite costly in a production setup (I'm talking about node.js and not on a browser).

The way I've solved this is by making sure that my log() function can accept functions as well. Hence, code such as this becomes possible:

var xml = an_xml_stanza();
log(function() {
  return ["XML Stanza:", xml.toString()];
});

The log() function needs to be patched to look like this:

log = function() {
  if (!logging)
    return;

  if (arguments.length == 1 && typeof arguments[0] == "function")
    arguments = arguments[0]();

  console.log.apply(console, arguments);
}

This essentially means that you've gotten rid of the runtime cost of calling xml.toString() when logging is disabled.

Monday, March 21, 2011

Node and the Proxy/Decorator Pattern

The proxy pattern.
The decorator pattern.

Proxies & Decorators are patterns that abstract away or enhance certain functionality in an entiry.

Since they are both almost the same, it isn't very helpful to know the finer differences between the two, but here they are just for completeness.

Example of a Proxy:
When you proxy some functionality, you generally delegate the responsibility to someone else. The most practical example is an HTTP Proxy that forwards requests on your behalf to the real web-server.

Example of a Decorator:
When you decorate some functionality, you generally add more functionality to it. For this example, I shall discuss the code that is presented below. It is basically an EventEmitter that transparently decompresses the string that is passed to it and re-raises the events that the original EventEmitter raised. You can use it to transparently read a compressed/secure stream given that you have a decorated EventEmitter for it.

File: gzip.js

var compress = require('compress');

function _inflater(stream) {
 this._stream = stream;
 var self = this;
 var gunzip = new compress.Gunzip;
 gunzip.init();

 this._stream.on('data', function(d) {
  var _d = gunzip.inflate(d.toString('binary'), 'binary');
  self.emit('data', _d);
 })
 .on('end', function() {
  self.emit('end');
 })
 .on('error', function() {
  var args = Array.prototype.splice.call(arguments, 0);
  args.unshift('error');
  self.emit.apply(self, args);
 });
}

_inflater.prototype = new process.EventEmitter();


function _deflater(stream) {
 this._stream = stream;
 var self = this;
 var gzip = new compress.Gzip;
 gzip.init();

 this._stream.on('data', function(d) {
  var _d = gzip.deflate(d.toString('binary'), 'binary');
  self.emit('data', _d);
 })
 .on('end', function() {
  self.emit('end');
 })
 .on('error', function() {
  var args = Array.prototype.splice.call(arguments, 0);
  args.unshift('error');
  self.emit.apply(self, args);
 });
}

_deflater.prototype = new process.EventEmitter();



exports.inflater = function(stream) {
 return new _inflater(stream);
};

exports.deflater = function(stream) {
 return new _deflater(stream);
};

File: test.js

var gz   = require("./gzip.js");
var http = require("http");

var req = http.request({
 host: "duckduckgo.com", 
 port: 80, 
 path: "/"
}, function(response) {
 console.log("response headers:", response.headers);

 //
 // We check if we need to create a proxy event emitter that
 // inflates data on the go.
 //
 if (response.headers['content-encoding'].search('gzip') != -1) {
  //
  // The response object is now the proxy object. A similar 
  // technique is used to support TLS encrypted sockets too.
  // We just wrap up the original EventEmitter instance in 
  // a proxy instance that does its magic.
  //
  response = gz.inflater(response);
 }

 response.on('data', function(d) {
  console.log("Got:", d.toString());
 });
});

req.setHeader("Accept-Encoding", "gzip");
req.end();

Sunday, March 20, 2011

Issue list in code

A few weeks ago, there was a debate/argument/discussion about what issue tracking tool should be used for our project. The 2 main contenders were JIRA and Basecamp. I'll try to capture the essence of the discussion.

Things going for JIRA:

Very customizable and configurable
Offers many workflows and views
Lots of plugins
Easy for manager to generate reports (good reporting features)

Things going against JIRA:

Not the most convenient to use for developers
Too much (unnecessary) detail to be filled in for each task at times

Things going for Basecamp:

Easy to use. Clean interface. More like a todo list
Developers like it

Things going against Basecamp:

No easy custom report generation facility
Not reporting/manager friendly

We've decided to try both out for a few months and see which one "works out" for us. We were using JIRA before, but the enthusiasm seems to have died out after a few months and the issue tracker is not in sync. with the actual issues/reports/bugs/features being developed.

I have an entirely different point of view though. For me what has worked best is comments in code that say something like:

// TODO: Such and such is broken. It needs to be fixed in so and so way.

The reason is that it's too much tedium for me to update things in 2 places (code and issue tracker) when I fix an issue. If I forget, then it's even harder to related the issue to the actual change made. This method offers a way of alleviating that problem.

I'm also okay with having comments that are verbose just so that they can explain in sufficient detail the reason (or purpose) of the code below.

However, this has been mostly for projects where I have been the sole developer. I don't know how well it would scale for projects with 2-3 developers, 5-6 developers and > 6 developers.

I would guess that it's fine for projects with up to 6 developers, but beyond that it would get cumbersome. Also, this way of tracking issues lacks a good reporting method which allows you to track changes across versions. Of course, if you want to adopt this as a workflow, I'm sure some tooling around this general idea can be developed which works by generating diffs between revisions to track which bugs this revision (or set of revisions) fixes.

Saturday, March 19, 2011

node-xmpp-bosh: A BOSH (connection manager) session manager for the XMPP protocol

You can find out all about what BOSH is from XEP-0124 and XEP-0206.

Most respectable XMPP servers out there such as Openfire, Ejabberd, and Tigase support BOSH as part of the server.

~~This is where you can download the node-xmpp-bosh BOSH server from.~~
node-xmpp-bosh has a new home at github!

I'll just discuss some of the reasons that it's good to have a BOSH session manager running outside of the XMPP server.

Easier to scale the BOSH server and the XMPP server independently (independently being the key here). See my previous blog post for a detailed explanation
You can support users from multiple domains such as gmail.com, chat.facebook.com, jabber.org, pandion.im, etc... using the single BOSH server
Any customizations to the BOSH server are easier made if it is running out of process (restart it independently of your XMPP server - since getting an XMPP server warmed may be more expensive than getting a BOSH server warmed up)
Use protocols other than XMPP at the back, but still retain the same interface as far as clients are concerned. This will still require you to use XMPP for the actual communication (Note: This can be accomplished via transports [for yahoo, msn, aim, etc...] available for any XMPP server [that supports XEP-0114] as a Jabber Component). But, if you please, you can also write a small drop-in replacement server that speaks basic XMPP, but does something entirely bizarre at the back!!

Scale out with services - Scale the services separately

Whatever I am writing about here has been taught to me by this brilliant write up that I quote from the DBSlayer project page.

"The typical LAMP strategy for scaling up data-driven applications is to replicate slave databases to every web server, but this approach can hit scaling limitations for high-volume websites, where processes can overwhelm their given backend DB's connection limits. Quite frankly, we wanted to scale the front-end webservers and backend database servers separately without having to coordinate them. We also needed a way to flexibly reconfigure where our backend databases were located and which applications used them without resorting to tricks of DNS or other such "load-balancing" hacks."

What this means is that it is better (from a scaling perspective) to split your application into services rather than have them run as one monolothic application on some very powerful machine.

Why is it better to do so? That's because each "service" may have different performance requirements and characteristics which can be better addressed in isolation if the service is running as a separate entity.

For example, consider a typical web-stack which consists of:

Load balancer (nginx)
Web server/script execution engine (apache/php)
Data Store (mysql/pgsql)
Caching service (redis/memcached)
Asynchronous processing infrastructure (RabbitMQ)

For a medium to high load web site (a few million hits/day => about 20-30 req/sec), you could make do with just a single instance of nginx, 2 machines running apache/php, 2 machines running MySQL and one machine running RabbitMQ. Even for this deployment, you can see that each of the components have different requirements as far as the machine (and hardware usage) characteristics are concerned. For example,

nginx is network I/O heavy. You could deploy it on a machine with a modest (1GB) amount of RAM, no hard disk space, and not a very fast CPU
The Apache/php servers on the other hand would need more RAM as well as more CPU, but no disk space
The MySQL nodes would need a lot of RAM, CPU as well as fast disks
The node running RabbitMQ (a message queue) could again comfortably run on a machine with a configuration similar to nginx (assuming that data is stored on MySQL)

Thus, we have sliced our stack into components we can club together not just based on function, but also based on the charasteristics of the hardware that they would best be able to run on. Aside: This reminds me of Principal Component Analysis.

Node.js as "the next BIG thing".

I generally hate to talk about "the next BIG thing" because there seem to be (relatively) fewer things (as compared to all the things that we encounter) that influence our lives in significant ways these days (simply because we encounter so many things these days).

However, I feel that node.js is going to be quite influential in the years to come.

So, what is "node.js"? Node (put briefly) is a javascript execution engine. It incorporates asynchronous I/O as part of its design and hence can be considered asynchronous by design. This is where it starts getting fascinating for me.

Do you remember all those $.ajax() requests you made in the browser? Well, what if you could so the same on the server side? What if stitching together a page on the server was as easy as making AJAX calls and filling in empty <div> tags? You needn't stretch your imagination too much because that's exactly what Node lets you do!!

Even more exciting is that fact that this asynchronous API isn't limited to just HTTP/AJAX calls, but extends to protocols such as SMTP, POP, XMPP, and Database handlers such as PGSQL, MySQL, and Sqlite. This is because the Socket & file system I/O on Node is asynchronous by design. (almost??) All the APIs which do I/O are asynchronous in nature. For some, blocking counterparts are provided solely for convenience.

Toy task: Make 10 HTTP web calls and present the output as the result of each of these 10 calls separated by a new line. Assume that each HTTP call takes 1 sec to produce the result.

Sample code (in your favourite programming language; using blocking I/O):

out = "";
for i = 1 to 10:
 # We assume that http_call blocks till the fetch completes.
 response = http_call(url);
 out += (response + "\n");
write_output(out);

You notice that the code takes 10 sec to run - 1 sec for each HTTP web call. We ignore the time taken to write out the results.

Sample code (using non-blocking I/O):

out = "";
res [10] = [ ];
for i = 1 to 10:
 # We assume that http_call does NOT block and retiurns immediately, 
 # processing the task in the background (another thread or using
 # non-blocking I/O)
 res[i] = http_call(url);

for i = 1 to 10:
 # We assume that the when statement blocks execution till the 
 # condition is met
 when res[i].complete:
  out += (res[i].data + "\n");

write_output(out);

You notice that the code takes 1 sec to run - 1 sec for each HTTP web call; all of which fire at the same time. We ignore the time taken to write out the results.

That's a phenominal increase of 10x from the blocking version. Now, why wouldn't anyone use Node for such a use case??

Tuesday, March 08, 2011

jQuery stack == evolution

So, twitter was kinda cramped, so I'm writing this here. A bunch of context around this. @addyosmani posted a tweet/link:

"Tools For jQuery Application Architecture – The Printable Chart http://bit.ly/fryIKW"

which I re-tweeted and to which @rakesh314 replied with this tweet/link:

"@addyosmani http://t.co/xWddsTK /cc: @dhruvbird"

So, there's one thing I strongly believe, which is that the jQuery stack seems to be a product of evolution, and evolution is generally not wrong - but that's a separate debate. Lots of people have tried different things and jQuery has evolved to permit all of them to co-exist in (relative) harmony.

The LAMP stack was not THE LAMP STACK before people realized these set of tools that were always used together. It was only after the fact that they decided to bundle them. Different Linux distros. are healthy in the sense that they package what they think their users want.

Similarly, I feel that the jQuery stack is progressing in that general direction and I won't be surprised to see vendors releasing their "version" or "distro" of the jQuery stack.

I'm not as much "against" dodo as I am "for" jQuery.

Sunday, March 06, 2011

HTTPS for all browsing

I've been thinking about using HTTPS for everything (even sites that don't support HTTPS) by routing all my traffic through a proxy that makes connections to the actual site (possibly on HTTP). This at least secures my traffic from packet sniffing on the local LAN.

What this means is that if you are at school or office, no colleague can run Wireshark or TCPDump on the the local LAN and capture/sniff your traffic. Also, you can now safely browse the web over http on insecure/potentially sniffed networks such as stray wireless networks without having to worry about your data being compromised! Welcome starbucks internet :-p

Traditionally, if the browser connects directly to a public proxy, then HTTP traffic still goes unencrypted (to the best of my understanding). Hence, this is what I've thought of doing.

Set up a local proxy on the same machine, which connects to a remote proxy over HTTPS.
Ensure that the remote proxy is running on a safe/trusted network (it could be your home PC if you want to use insecure wireless networks securely)
This remote proxy can now make HTTP connections and the issue of local packet sniffing is resolved.
However, it doesn't prevent remote packet sniffing (on the network where the remote proxy resides), which is why it is important to have the remote proxy sitting on a secure network.

If you are seriously planning to use this proxy, and you aren't yet using HTTPS Everywhere, I would strongly suggest that you start using it since it will reduce the load on the proxy and is more secure (since the encryption is end-to-end and not proxy-to-end).

Mamma says that there shall be a day when browsers pop up a warning when you view an http based page (as opposed to an https based one).

Update: You can grab the code for this proxy here

Sunday, February 06, 2011

Attitude - and a bit of tea!!

Yesterday, I had gone to have Chai (tea) and Brun Pav at Yazdani Bakery. When I was paying the owner (Rashid Zend in the photograph on the previous link) the amount due, he asked one of the waiter to bring a half eaten pav (bread) that a customer seemed to have left behind. He picked it up from the plate, examined it carefully and asked the waiter (in Hindi) "Why did he leave this uneaten?" I was quite taken by surprise by what he had just asked, but was also very very excited about having eaten at a place where the owner himself took such a keen interest in produce and tried to rectify even a single small error that may have occurred on his (bakery's) part.

Now, if we (as programmers) take every customer feedback this seriously and act upon it, I am fairly sure that the quality of the final product will go up by leaps and bounds!!

Monday, January 31, 2011

Fixing javascript variable leaks

Today I happended to have a look at my application's window object and was shocked to see that 'i' was defined on it! I doggedly started searching my code to see where the leak had perpetrated itself from. After an hour of debugging, which included using Google's online closure compiler (Thanks Sandy!!) and jslint, I came up with nothing.

It struck me that it could be some of the js-libraries that I was using that were causing the leak, and not my code. I tried doing a binary-search on the js-includes, but that didn't work since the code wouldn't even run without most of those includes. At this point I really didn't know how to even go about trying to fix it!!

A few hacky thoughts later, I decided to introduce this code in my application:

var _iv = setInterval(function() {
  if (typeof window.i != "undefined") {
    clearInterval(_iv);
  }
}, 10);

Of course, this relied on the fact that I had inserted a verbiage of console.log() statements in my code at various points (It actually helped to have them!!)

This bit of hackery at least helped me determine that the error happened after a certain set of statements (functions) had executed. From here, it was manual work, checking every function and callback (yes) in my code as well as any library that I was using.

After about 30 minutes, I was able to trace the leak to embed.ly's jQuery library. I also discovered that the fb-complete jQuery library I was using had this leak, but it wasn't the one responsible for the specific instance of the variable 'i' that I was seeing then.

Sunday, January 30, 2011

When to use events & when to use callbacks?

With the recent spate of javascript programming that I've been subject to, I've found myself asking this question quite often to myself "Should I use callbacks here or should I use events (jQuery trigger & bind)?"

I've found this initial approximation quite useful till now: "If the operation that will follow depends upon the exact instance of this operation, then wrap the state in a closure and use a callback. Otherwise, use events." The advantage of using events is that:
1. You need not think of all the actions that could result in this event handler being called.
2. You need not tell anyone that this particular event handler needs to be invoked. Everyone just publishes some data (like a tweet) and all those interested (followers) just gobble that data up.
3. You can dynamically attach event handlers. So, if some component in the system suddenly decided to react to certain events, it can just plonk itself into to the listeners queue for that event.

jQuery goes one step further and allows you to trigger (and bind) events on specific DOM nodes only!! And then it goes a step further and lets you define an event namespace to try and prevent event name conflicts.

Tuesday, January 18, 2011

SQL Query optimization - Part-8 (Indexes Schmindexes)

The "Why are my queries slow despite indexes on all columns" Edition.

Consider the following table:

CREATE TABLE florist_orders(
 order_id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, 
 basket_color INT (1) NOT NULL, -- This can be either 0, 1, 2 or 3
 num_roses INT NOT NULL, 
 num_sunflowers INT NOT NULL, 
 num_daisies INT NOT NULL, 
 processed INT(1) NOT NULL DEFAULT 0, 
 created TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Let's assume that the packaging for every basket of a different color is done by a diffrent person. Additionally, baskets packaging also differs based on the maximum number of flowers of each type. If there is more than 1 of any type, the packaging differs from when there is at most 1 of each type of flower. Each packager wants to check the orders that she needs to process.

The following query works fine:

SELECT * FROM florist_orders
WHERE
 basked_color = 2 AND
 processed = 0 AND
 (num_roses > 1 OR num_sunflowers > 1 OR num_daisies > 1)
ORDER BY created ASC -- Process older orders first to avoid starvation
;

However, as the size of the florist_orders tables gorws, you realize that each query is taking too long. Why is this? You have an index on every column and yet queries are slow?

You decide to create composite indexes. As a first attempt, you create an index on

(basket_color, processed, num_roses, num_sunflowers, num_daisies)

but notice that it hasn't really affected the query's execution time. Additionally, EXPLAIN says that no indexes are being used!! What to do now?

To solve this issue, you would need to modify your table and make it friendly for this specific query. These are the steps you would need to take:

Add a column named num_flowers which is set to MAX(num_roses, num_sunflowers, num_daisies) on every INSERT and UPDATE via TRIGGERS.
Then, create an index on:
```
(basket_color, processed, num_flowers)
```

Modify your query to read:

SELECT * FROM florist_orders
WHERE
 basked_color = 2 AND
 processed = 0 AND
 num_flowers > 1
ORDER BY created ASC -- Process older orders first to avoid starvation
;

This should pretty much fix any speed issues with the aforementioned query and ensure that all queries remain responsive. This trick will work ONLY if all the flowers are being checked to be greater than the SAME number.

Tuesday, January 04, 2011

Firefox, I love you!!

The day was saved by Firefox's page cache. I typed in a reply on a forum thread, clicked "post reply" and then closed the tab, only to realize that I wasn't logged in and that a login dialog had popped up just as I had closed the tab. I right-clicked the tab-list and clicked "Undo Closed Tab" and poof!! the complete tab with all its state was there, awaiting my credentials!!

Awesome stuff Firefox... I love you!!

Saturday, January 01, 2011

Peer-2-Peer systems at their best!!

Recently, the skype network went down. You can read all about it here.

However, the exciting take away from this is (quoting from the link above) "In order to restore Skype functionality, the Skype engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to act as dedicated supernodes, which we nick-named “mega-supernodes,” to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud."

Which is exactly my idea of how a self-healing peer-2-peer system should behave. Introducing more nodes into the mix should ideally help solve scaling and availability issues, which is exactly what the Skype team did. I would like to congratulate them for that!!