Some thoughts on your (and my) thoughts.

Saturday, October 26, 2013

Javascript as a language is really darn good

If you look at computer languages as tools to help you convert thought to code (ignoring other real-world practical considerations such as efficiency, developer availability, platform pervasiveness, library availability, etc...) then my opinion is that Javascript (in the form of node.js as we know it today) really stands out. Let me explain why I feel so.

Javascript is a small and minimal language with few (redundant) constructs and can be taught and remembered easily. This is useful for someone who is trying to use javascript to write code as well as someone who is trying to read javascript code. When you are expressing your ideas, you want to know what tools you have and want to be able to fit them in your head. At the same time, when you (or someone else) (re)visit(s) your code, you want the reverse transfer of information (i.e. code to idea) to be quick and unambiguous and the reader shouldn't get confused trying to figure out the language syntax and semantics. To explain,
- The fact that functions are first-class objects and that the lambda syntax isn't different (or otherwise special) compared to the standard function definition syntax helps.
- As does the fact that objects/maps (hashes), arrays, etc... all have the same get/set syntax.
There's very little fluff in the language. Most keywords don't feel like they're useless. The only ones I find excessive are 'new', 'prototype', and 'function'. Coffeescript solves this to some extent.

What this means is that it's very easy to get started with your first program once you have an idea of what it is that you want to express in code. Javascript lends itself well to elegant code.

Since there is no main() function in a node.js script, you don't need to encode all your executable code in a main() function. You won't believe how incredibly useful I find this even though I have programmed in C and am used to that style of programming. This also means that each file can have its own test() function that is run by just invoking:
$ node file_name.js.

Declaring arrays and objects (maps) in javascript is really easy, and anything that can be reasonable done in 1 line is done in one line. For example, declaring an array a that has 4 elements, 44, 31, 21, and 23 is:
var a = [ 44, 31, 21, 23 ];

Declaring an object o that maps key 'name' to 'blogger', key 'url' to 'http://blogger.com/', and key 'age' to 5 is as simple as:

var o = {
  'name': 'blogger',
  'url': 'http://blogger.com/',
  'age': 5
};

Notice how values in maps can be of different types. This is also true of elements in an array.

Converting a string (or any other type) to a string isn't rocket science either.
var sint = String(89);

To test my belief, I've started writing a toy regular expression parser and evaluation engine in javascript that will have some intentional holes (un-impelemented features) that you can fill up later so that you can get a feel for both how regular expressions work, as well as how javascript as a language handles.

Sunday, April 21, 2013

Inside the guts of Kadane's algorithm OR Understanding the Maximum Sum Subarray Problem

Kadane's algorithm is used to solve the 1 dimensional Maximum Sum Sub-array Problem in computer science.

Let's try to understand how it really works. If we are given the problem of finding the maximum sum sub-array of a given array, the first native approach we can try is the O(n²) algorithm of starting at every array index and computing the sum from that index to every index after it. This works, and gives us the correct answer, but we should ask ourselves if we can exploit certain properties that the problem might have to try and speed up the solution.

Let's try and dig deeper into an example.

Consider the following array & its corresponding cumulative sum subarray:

Element	10	-5	-2	7	1	-5	-3	2	4	-3	6	-21	5	-2	1
Cumulative Sum	10	5	3	10	11	6	3	5	9	6	12	-9	-4	-6	-5

Some observations:

A maximum sum sub-array can never have a negative number as one of the elements at the end-points, except of course if every element in the array is a negative number. If a maximum sum sub-array had a negative number on one of its end-points, we could remove that element and increase the value of the sum, thus getting a sub-array with a larger sum. Conversely, a maximum sum sub-array always starts and ends with a non-negative number, unless of course all the numbers in the array are negative.

We can always clump all negative and non-negative numbers together since once we encounter a negative number, the sum always drops and once we encounter a non-negative number, the sum never drops.

If the running sum of a sub-array array ever falls below zero, no solution will ever include the negative number that caused this sum to fall below zero since it over-powers the positive sum accumulated before it. Note: We only speak of 1 negative number because of the clumping point above.

An extension of the point above implies that the new running sum that begins once a cumulative sum falls below zero always starts from the immediately following non-negative number.

Applying the rules above, we get the following recurrence:

MaxSum[i] is the Maximum Sum ending at index i.

MaxSum[i] = Array[i]                if i = 0
MaxSum[i] = MaxSum[i-1] + Array[i]  if MaxSum[i-1] + Array[i] >= 0
MaxSum[i] = 0                       if MaxSum[i-1] + Array[i] < 0

The re-written array (which now consists strictly of alternating negative and non-negative numbers) and the new cumulative sum sub-array are (the row "Cumulative Sum" below represents the MaxSum variable above):

Index	1	2	3	4	5	6	7	8	9	10	11
Element	10	-7	8	-8	6	-3	6	-21	5	-2	1
Cumulative Sum	10	3	11	3	9	6	12	0	5	3	4

Sunday, March 17, 2013

How to give a talk/presentation - Wisdom from Dr. Bender

(The content of this post is mostly ~~stolen~~ inspired by this post by my friend Deepak.)

This Spring, I took a course called CSE638 (Advanced Algorithms) under Professor Michael Bender. This was a mostly research oriented course, with a focus on doctoral students. An important part of doing Academic Research is talking/presenting your material in conferences and seminars. Professor Bender spent a couple of classes discussing this, and I thought I'll list down the stuff he mentioned - more of a note to myself.

Make the talk prefix-competitive
If somebody dozes off, the person should have got the best part of your presentation/content before he/she dozed off. Bring the best part of your talk to the front. Make it prefix-competitive.
Stand in front of the screen, Smile :)
Do not stand away from the screen, in some corner behind the podium. Stand in front of the screen. Let the Projector's display glow on your face. Stand confidently in front of the screen - okay now, don't block the screen, but stand to the edge of the screen, and Smile while you speak. Convey enthusiasm. "People like seeing faces". Facebook has become popular for a reason ;)
Diagrams and Figures!
Have plenty of diagrams and figures on each slide.
Refer to every slide & everything on each slide
Refer to each and every item on your slide. If there is a diagram, explain everything. If you're not going to explain something, refer to it and say that you're not going to explain it. Either way, please do refer to everything on your slides!
Touch the screen, point to it
Don't be afraid of the screen. Touch it. While explaining the diagrams and figures, stand over the screen and explain stuff by physically touching the screen with your fingers/hands. That very act conveys confidence.
Results Up Front
If your talk is about the results, let the results be as far upfront as possible. Build the context as soon as possible, and announce what you did.
Explain the Title
Spend time explaining the title. And the background. This might seem to conflict with point-4, but it's not. That's the trick.
Give credit wherever possible, and do this in the beginning
When giving credit to others, please try to do this when you start rather than when you end. People like hearing about other people. Did I mention something about facebook earlier?
Explain why the problem is important
This I feel is one of the biggest take-aways from the talk. Even if the audience doesn't understand the solution, they should understand why we need a solution in the first place.
Make use of plots effectively
Explain the axes, and know what to plot. Refer to this post by Gaurav to know more about what this means.
Know your audience
For example, presenting the workings of a toaster to a homemaker is different from presenting it to an electrical engineer. You'll need to motivate the problem differently in both cases. Same goes with presenting it to someone who has had a toast before v/s someone who hasn't. I find that Dr. Dan Gusfield does a brilliant job of motivating problems before presenting the solutions (related to point-7).
There are a couple more about Jokes, Color Schemes, etc. which I can't recall.

There now, nobody can stop you from giving an awesome talk!

PS - If you're from Stony Brook, and find that I've missed something, feel free to write in the comments. Thanks!

Update: Found this article online.

Thursday, March 14, 2013

PMA: The Packed Memory Array

Thanks to Deepak & Gaurav for proof-reading this post and providing useful feedback!

The Packed Memory Array (pdf) is an interesting structure to say the least, since it keeps its elements almost contiguously (and in sorted order) in memory with no additional indirection via pointers like a usual array and still allows one to perform fast - O(log²N) - updates to the array.

We know that inserting elements in sorted order into an array without keeping gaps between consecutive elements costs O(n) per insert, whereas searching for an element can be done using binary search with a cost of O(log n). These are tight upper bounds, but the story is a little different for randomly arranged data. If one is inserting random data into a sorted array with gaps being maintained between consecutive elements, the expected time to insert a single element magically falls to O(log n)! Now, what just happened here? To find out, read more in Bender, Colton, and Mosteiro's paper titled Insertion Sort is O(n log n) (pdf). On the other hand, if we don't permit gaps between elements, even for random data being inserted, the amortized cost for inserting n elements into an array in sorted order is O(n²) - why? (hint: Refer to the expected case analysis of quick-sort).

The simple idea is to not pack all elements together, but to maintain some gap between consecutive elements. We shall see that if we follow this simple idea, then the cost for insertion falls to O(log²n) amortized worst-case. This is the packed-memory-array (PMA). We however need to formalize the idea a bit and set some rules of the game before we get ahead of ourselves.

We'll start off by assuming that we already have a PMA that holds n valid elements. One of the invariants we have for the PMA is that it should be more than 0.25x full (this is called the fullness threshold). i.e. If the PMA has space for 4n elements, then there should be at least n actual elements in the PMA. Any less and we should re-size the PMA to have space for 2n (not n) elements (this is also part of the fullness threshold). The reason we maintain extra space in the PMA is so that we can re-balance and that re-balances involving a lot of elements won't happen too frequently.

Let's just focus on insertions for now. The PMA is organized as a contiguous array of slots which might be used or free. Conceptually, we break this array of size N into N/log N blocks, with each block holding log N elements. We'll see why this is helpful. If we look at a PMA as being made up of such blocks of size log N each, then we can view the PMA as binary tree (conceptually) with each level having different fullness thresholds.

The algorithm for inserting elements relies heavily on the upper density threshold whereas the algorithm for deleting elements relies heavily on the lower density thresholds. For the sake of brevity, I shall only discuss insertion (not deletion).

Algorithm: When we insert elements into the PMA, we follow these steps:

Locate the position to insert the element into. We either know this before-hand or we perform a binary search which costs O(log²N).
If the cell we want to insert into is free, we just add the element, and mark the cell as used. We are done!
If however, the cell is used, we compute the upper density threshold for the smallest block (of size log N) that the desired cell falls within, and check if the upper density threshold would be violated. If we notice that there is no violation, we just re-balance all elements including the new one into that block. We are done. If we violate the upper density threshold, we consider a block twice as large (which includes the cell we will be inserting into) and check if the density threshold is violated. We repeatedly move up till we find a chunk for which the upper density threshold is not violated.
If we fail to find such a chunk, we just allocate an array twice as large and neatly copy all the existing elements into the new array with constant sized gaps between elements!

Analysis: We analyze the cost to insert an element into the PMA.

Pre-requisite: Weight balance for fun and profit

The upper (and lower) density thresholds are arranged so that they grow arithmetically from the top (root) level to the bottom (leaf) level.
The difference in density thresholds is 0.5, and we have log N levels, so if we want to maintain a constant arithmetic difference between levels, there must be a difference of 0.5/log N between each level. This is a difference of Δ = O(1/log N) between each level.
Q. What is number of elements that should be inserted at a certain level to bring it out of balance?
A. Clearly, if a level has room for N elements, and it is out of balance then that could have happened only if it went from being in balance to now out of balance, which means that O(ΔN) elements were inserted into this level.
Q. What is the number of element moves we need to bring a level back into balance?
A. If a level is out of balance, we typically go up till a level within density thresholds is located and re-balance it. Ideally, going up one level should do the trick, so to re-balance a level containing N elements, Θ(N) operations are sufficient.
Therefore, the amortized cost to re-balance a level is O(N / ΔN) = O(log N).
However, we must not forget that an element insertion affects the thresholds of O(log N) levels, which means that the actual cost for insertion is O(log²N).

You can also read about the analysis in section-4 of another paper.

Q1. What if we use space proportional to Θ(N^c) (assume c=2) to store N elements? What happens to the running time for insert?

A1. Well, it just goes over the roof since you're now going to be moving elements across a lot of unused cells while you re-balance the array. Additionally, you'll also need to to adjust your level density thresholds to not be arithmetically increasing, but geometrically increasing. Instead, if you use tagging and maintain elements as tags and pointers to actual values, you can get better running times if the tag space is polynomial (N^c) in the number of elements in the structure.

Q2. Is the PMA a cache-oblivious data structure?

A2. The PMA is Cache Oblivious, and is used as a building block in other more complex external memory data structures such as the Cache-Oblivious B-Tree.

Implementation: You can find a sample implementation of the PMA here.

Friday, March 08, 2013

Weight Balance for fun and profit

According to Dr. Bender, Weight Balance (different from the wikipedia article on weight-balanced tree) is one of the most important topics in Computer Science, and we were lucky enough to learn it from him! Others might not be so lucky, so let me try and do a job that's hopefully a constant factor (or fraction) as good - since it won't matter because all we care about are the asymptotics.

Pre-requisite: This note on Amortized Analysis.

We'll use weight balance to implement a dictionary structure and examine how the guts of one such structure, the weight-balanced tree work.

A dictionary data structure is one that supports the following dictionary data structure operations:

Insert
Delete
Find
Predecessor
Successor

Points 4 & 5 are what distinguish a dictionary data structure from a keyed data structure such as a Hash Table.

Now, you might have heard of the following data structures that (efficiently) support the operations mentioned above:

It would surprise (or maybe not) you to know that both these structures work on the principle (guess) of weight-balance!!

So what exactly do we mean when we talk about the weight of a sub-tree in a BST? Well, as it turns out, the weight of the sub-tree in a BST is just the count of the number of nodes in the sub-tree rooted at that node (including the node itself).

For example, the following tree (image courtesy wikipedia) has a weight of 9

and the sub-tree rooted at node 10 has a weight of 3.

A weight-balanced tree rooted at node u is one in which (either):

The weights of the left and right children of a sub-tree are within constant factors of each other:
weight(Left-Child(u)) + 1 = Θ(weight(Right-Child(u) + 1)
Note that the +1 is important for pedantic reasons as far as the order-notation is concerned.
OR
The weights of the left and right children of a sub-tree are within constant factors of the weight of the complete sub-tree
weight(Left-Child(u)) + 1 = Θ(weight(u) + 1) AND
weight(Right-Child(u)) + 1 = Θ(weight(u) + 1)

It turns out that both these definitions are equivalent.

More realistically, if we stick to the second definition, we have:
weight(Child(u)) + 1 ≥ 0.25 * (weight(u) + 1) AND
weight(Child(u)) + 1 ≤ 0.75 * (weight(u) + 1)
where, Child(u) denotes both the left & right child of u.

For example, if we consider the following example tree, which is clearly out of weight-balance (don't ask me how we got there because this example is made-up), we re-balance it to be perfectly in balance (if we have an odd number of nodes or almost perfectly balanced otherwise).

We should be careful about how we re-balance these N nodes, because if the cost is any worse than Θ(N), then we won't get the update costs that we desire. The easiest way to perform the re-balance with a cost of Θ(N) is to perform an in-order traversal of the subtree rooted at node u, and write out the sorted nodes to an array. We can then re-create a perfectly balanced BST from that array either using recursion or the Day–Stout–Warren algorithm.

This is where the fun starts!!

Q 1. How many nodes need to be inserted under a sub-tree rooted at node v to bring the sub-tree out of balance (assuming it is perfectly balanced to start off with)? Let's assume that the sub-tree originally contains N nodes.
A. You need to insert some constant fraction of the weight of that sub-tree! which is Ω(N).

Q 2. What is the cost to rebalance the sub-tree rooted at node v if we know that that sub-tree has a weight of N?
A. Well, we already answered this above. The answer is Θ(N).

Q 3. How many sub-trees potentially go out of balance when you insert a node?
A. We know that a node is inserted at the leaf level, so potentially all the sub-trees that are rooted at the nodes on the leaf-to-root path with the newly inserted node as the leaf node can potentially go out of balance. This happens to be Θ(log N) nodes.

∴ the amortized cost to insert a new node into the balanced tree is:
Ω(N)/Θ(N) * Θ(log N) = Θ(log N).

Now, that's a fairly straight-forward algorithm to get the same (amortized) costs as the worst-case costs for updates with a more complicated beast such as an RB-Tree or an AVL-Tree. Though, I feel that Treaps are much simpler to implement.

Tuesday, March 05, 2013

Amortized Analysis or: How I learned to stop worrying and love averages

Amortized Analysis is usually seen as a pretty scary term, and I've seen a lot of people confuse it with Average Case Analysis, but let's try and de-mistyfy it, one step at a time.

We've performed amortized analysis at some point or another in our lives without actually knowing it. A few lame examples follow:

For the stock broker: Purchasing shares at a lower price to average out the cost price of all the holdings of a given stock
For the fitness enthusiast: Working out thrice as much at the gym today because [s]he missed 2 days before today
For the reader: Reading a few more pages of a book so that you can take a break tomorrow and still complete it on time

There are situations where amortized analysis doesn't work too. For example, suppose you're in charge of cooking for your family, you can't say that you'll cook thrice as much on the 3rd day from today and not cook at all between now and then. Sorry, but sometimes it just doesn't work!! These are cases where you must use worst-case bounds.

Let's try some exercise and learn by example!

Ex-1: You're jogging 16 miles every day for 8 days, and your friend jogs 8 miles and 24 miles on every odd and even numbered day respectively (starting from day #1). Who jogs more over a period of 8 days? Here is a graphical representation of how much you and your friend ran over a period of 8 days:

Ex-2:You're jogging 16 miles every day for 7 days, and your friend jogs in the following manner:

Day Miles Jogged

1 2

2 2

3 4

4 8

5 16

6 32

7 64

Who jogs more over a period of 7 days?

Ex-3: You're playing a game where you have a graph and you start at the node with the symbol S and finish at the node with the symbol F. The constraints on your moves are that you must take EXACTLY ONE blue coloured edge in every move, but you can take as many (or zero) red coloured edges in a move. A move contains a combination of red and blue edges.
An example graph is shown here: One trace of a successful completion of a game that visited 11 blue edges and 4 red edges is shown here: Can you identify the maximum number of edges of any colour that will ever be visited if I tell you that exactly 107 blue edges were taken in a particular successful completion of the game?

Day	Miles Jogged
1	2
2	2
3	4
4	8
5	16
6	32
7	64

If you got this far, and were able to solve all the exercises - congratulations! - you've understood what amortized analysis is all about! And as an added benefit, Ex-2 is how one would go about analyzing the insertion cost for Dynamic Arrays, and Ex-3 is actually how one would analyze the running time for the KMP string matching algorithm!

PV=nRT or: How I learned to stop worrying and love cooking under pressure

I'll try to be as brief as possible here. Just going to talk about the physics behind Pressure Coking and why it's so cool (and hipster). Hey, it saves fuel and hence damages the environment to a much lesser much lesser extent than does normal cooking.

From How Does A Pressure Cooker Work?: "Simply put, water boils at 212°F (100°C). At this point, no matter how long you continue to boil, it always stays the same temperature. As the water evaporates and becomes steam it is also the same temperature, 212°F.

The only way to make the steam hotter (and/or to boil the water at a higher temperature) is to put the system under pressure. This is what a pressure cooker does. If we fit an absolutely tight cover to the pan so no steam can escape while we continue to add heat, both the pressure and temperature inside the vessel will rise. The steam and water will both increase in temperature and pressure, and each fluid will be at the same temperature and pressure as the other. "

To explain the last paragraph above, let's turn to physics and the Ideal Gas Law, which states that PV=nRT where,

P	=	Pressure
T	=	Temperature

and we don't really care about what the other symbols mean.

This means that Pressure and Temperature vary directly with each other, and if you raise the pressure of a fluid, then the temperature at which it changes state will also increase!!

Saturday, February 23, 2013

Parvorder Platyrrhini or: How I learned to stop worrying and love monkeys

As usual, this is going to be short.

I was coming back by train from Charni Road to Churchgate after a swim at Mafatlal Bath when an eunuch with a baby monkey (maybe less than 2 years old) walked in and started asking for money.

At first, I just ignored them since I had something going on in my head, but eventually they got my attention when the monkey started acting acrobatic in the train and started swinging from pole to pole and climbing the inner walls of the railway carriage (bogie in India).

I walked towards them and quietly handed the owner a ₹5 coin (that's about $0.1). What followed was pure ecstasy! The monkey climbed on me starting from the feet upwards and used my shorts and tee-shirt as support. I found it lodged in my arms like I would hold a human baby. It then walked all over my shoulders and lodged itself on my head (I had just cut my hair super tiny, so I guess that helped). It sat there patiently; probably observing the people around. I'm sure that the others felt as if I was the owner of the monkey! Never ever have I had something like this happen to me, so I was absolutely overjoyed with the proceedings. I wasn't even worried about getting all dirty right after my swim because this was just as fantastic a feeling as I could (not) think of!

Eventually, it was time to step off the train and the owner of the monkey started tugging at its leash, but the monkey would not budge! ha! After a few sharp tugs though, the baby relented and it was time to part ways.

Reminds me of:

Thursday, February 21, 2013

A great day for freedom

Today, I took Abbas, my dear dear dear friend (and a Muslim) to a pool where for the longest time, only Hindus were permitted to congregate and bathe. Only for the last 2 years has this ban on other religions been lifted (which I felt was unhealthy to start off with). Sense has been knocked into certain quarters!

It feels quite liberating to be able to have the option of not worrying about religion, etc... before associating with people. It's a person for crying out loud!! I remember my days in school where we didn't know anything about the concept of religion and never ever let that feature in the equation of friendship or acquaintanceship.

Why did we let religion govern our lives and let it stop us from doing harmless things we otherwise would have?

हिन्दू, मुसलिम, सिख, इसाई हम सब भाई भाई हैं ।

Sunday, December 09, 2012

Better Index Compression

It's late, so I'll keep it short and to the point.

An index is something that lets you get at data quickly. I consider a B-Tree as well as a sorted array an index. The time to lookup an element in an array is O(log₂ N) whereas that in a B-Tree is O(log_B N), where N & B mean their usual things.

If we just consider static structures (sorted array for example), and we wish to compress the data but maintain the retrieval speed, one can think of compressing blocks of size B and then performing a binary search over blocks. There are O(N/B) such blocks, so the cost of performing a binary search is O(log₂ N/B), which involves decompressing O(log₂ (N/B)) blocks of size B each.

Instead of B, if we chose O(log₂ N) elements to fit in a block (intuitively, assume that O(log₂ N) = kB, for some constant k > 0), we land up with O(N/log₂ N) blocks, each of size O(log₂ N), and the asymptotic cost to query remains the same, but we decompress O(log₂ (N/log₂ N)) blocks this time, which is asymptotically better than before.
If OTOH, we sample out every O(log₂ N)^th element from the sorted array, and create an uncompressed array from these elements, and make each element point to a compressed block containing all the elements between this element and the next element, then each element points to a compressed block containing O(log₂ N) elements.

Queries can be answered by performing a normal binary search on the array of sampled elements, and then at most 1 decompression operation is sufficient to locate the element in the compressed block. This is better than the O(log₂ (N/log₂ N)) blocks we had to decompress in the earlier scheme. The asymptotic cost of a query remains the same but we gain a lot by performing exponentially fewer decompression operations!!

Credits to Anirban for this explanation: Another way to look at it is to imagine a B-Tree built from these sampled elements, and each node in this B-Tree points to a compressed block containing O(log₂ N) elements. Alternatively, this can also be seen as a B-Tree built in such a way that internal nodes are stored uncompressed, but the leaf nodes are stored compressed.

Monday, November 05, 2012

Twitter API: Y U No respond fast & why

There is a very interesting post about an application developer running into issues with the Twitter API timeout of 5 sec. I would encourage everyone to read it if you haven't already.

Based on what I read (and some assumptions), I am going to try and comment on what the problem could be and potential solutions. I am greatly interested in criticisms of the analysis and possibly different ways of looking at the problem and possible solutions to it.

[A] Assumption: Twitter stores tweets for its users in an Inbox for every user. So, every time someone I am following makes a tweet, it is copied into my Inbox. This means that if 1M people follow me, my tweet is copied to the Inbox of 1M people. Let's call this value the fanout of the tweet.

Every time twitter needs to copy a tweet I make to the Inboxes of my followers, it needs to fetch a list of people that are following me. It does this by querying the database and fetching the list of people that have me as one of the people they are following. MySql (and other databases) do not allow you to perform efficient offset based queries, and you need to either:

Use cursors (which tie up server resources for the duration of the complete request) or
Perform a complete scan of all records till the requested record offset

every time an offset based query is made.

This is fairly wasteful of resources, whichever way you look at it, when in fact better solutions exist as mentioned in the post above.

[B] Assumption: When you perform an API request to Twitter to fetch the list of followers for a particular user, I am guessing that Twitter looks up a cache to locate the list and if it is unable to find it in the cache, it goes and looks up the DB for the list of these followers.

The reason that the API returns only the list of follower IDs (instead of complete user information) is so that they can avoid Point Queries which are notoriously inefficient. Of course, network bandwidth could be another reason.

[C] Now that we have established the parameters and behaviour of the system with which we are dealing, we can try and think of a solution.

Ensure that the list of followers we want to query are already in the cache so that we don't query the DB for the list. This can be done by making a web-request to the API without your credentials and then making the same request again with your credentials so that you aren't charged for the request that loads data into the cache. This will not always work, and Twitter might ban your IP.

Binary search over the size of the result set that twitter can reliably return for a user. Start with 1 follower, and ramp up to 2, 4, 8, 16, etc... (double) on every request till twitter times out. Subsequently, perform a binary search between the size that failed (say 128) and the size the succeeded (say 256). You are guaranteed to have fetched 36 follower IDs/request before you hit the request that timed out. Suppose the upper limit was 128, you will not get anything in the subsequent 8 requests, which means you still got an average of 17 follower IDs/request. Store this number, and use this number to start the next time you perform requests for this user.

Randomized solution: If you get 2 successive failures [timeouts] for a user, give up and try another user. With high probability, you will be able to query the followers of O(log n) users if you try to query O(c log n) distinct random users. This is because the probability of failure shrinks rapidly as you get consecutive failures for distinct users.

Do let me know if you spot something wrong or have other solutions.

Saturday, August 11, 2012

The words are not enough

In an earlier post I hinted at creating a book which teaches algorithms & data structures primarily using pictures. Here is a start. Caveat: Start from the bottom of the pin board to see the older posts first and work your way to the top.

The way I see it being used is that it should provide enough intuition that people should be able to look at the pictures and get the gist of the algorithm or data structure in under 3 minutes. You can always find more information (details) about the algorithm or data structure on other places on the internet. This book hopes to act as a place where one can:

Develop intuition about algorithms & data structures
View the guts of an algorithm or data structure graphically for easier recall
Quickly glance at the picture of an algorithm or data structure to recollect the underlying idea behind it (revision for interviews, exams, etc...)

I'll be extremely happy to receive feedback of any kind on this (what I think is a unique) way to learn algorithms and data structures so that I may progressively improve the quality of the content.

Thursday, August 02, 2012

SQL Query Execution Optimization opportunities missed: Part-1

There are some common queries that are used almost everywhere, but they seem to be not executed very cleverly by most (all??) SQL execution engines.

An example of one such query is the quintessential pagination query. Assume you have a schema for comments that looks like this:

CREATE TABLE COMMENTS(post_id INT, 
  comment_id INT, 
  comment_text TEXT,
  poster_id INT,
  posted_at TIMESTAMP,
  PRIMARY KEY (comment_id),
  UNIQUE KEY (post_id, posted_at, comment_id)
);

Now, you display the first page (say the most recent 10 comments) for a post (with id=33) using a query such as:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 0,10;

The LIMIT 0,10 syntax fetches at most 10 comments from offset 0 in the result set.

The most natural way to fetch the 2nd page would be to say:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 10,10;

Similarly, you can get the 3rd page by saying:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 20,10;

and so on...

You must have noticed that in case LIMIT 1000,10 is specified, a naive execution of the query plan involves fetching all the 1000 rows before the 10 interesting rows are returned by the database. In fact, most databases, will do just that. There is in fact a better way to execute such queries.

(Let's leave aside the discussion of whether pagination queries should be implemented this way. The observant reader will notice that we have an index on (post_id, posted_at, comment_id), and that can be used to fetch the next 10 results given the post_id, posted_at, and comment_id of the immediately preceding comment).

We know that most (of not all) databases use B-Trees to store indexed information. It is easy to augment the B-Tree to hold information such as "how many elements under this node (including the node itself)?" This information alone will let us very quickly (in O(log n) I/O look-ups) locate the node of interest. Suppose we want the entry from the COMMENTS table at offset 1000 with post_id=33, we will first perform a look-up for the first entry with post_id=33. Once we find this key, we can quickly (in O(1) time) determine how many entries are less than the located entry (since we already have this information cached at each node). Let the found node be at offset OFF1. Subsequently, we can query the B-Tree to find the entry that has an offset of OFF1 + 1000 (since we have the cumulative counts cached for every value in the B-Tree node!).

For example, consider the Binary Tree shown below:

The alphabets in the top half of every node are the keys and the number in the bottom half of the node is the count of the number of nodes in the sub-tree rooted at that node.

We can answer the following query on this tree in O(log n) time:

SELECT KEY FROM BINARY_TREE
  WHERE KEY > D
  ORDER BY KEY
  LIMIT 3,2;

i.e. We want to fetch the 5 smallest keys greater than 'D', but we want only the last 2 from this set. i.e. The 2 greatest keys that are a subset of the set containing the 5 smallest keys greater than 'D' (read that a few times to get a hang of it).

We proceed as follows:

Find D
Find Rank(D) = 4
Find the KEY such that Rank(K) = Rank(D)+3 = 7. This happens to be the KEY 'G'
Perform an in-order traversal to find the in-order successor of the node that we are on till we either run out of nodes or we have walked 2 nodes (whichever comes first)

Note: To compute the rank of a node we use the following formula:

Rank(Node) = SubtreeSize(Node->Left) + 1 [if Node->Left exists]
Rank(Node) = 1                           [otherwise]

Hence, we can see that we are able to satisfy this query in time O(log n + Result_Set_Size), where n is the number of elements in the Binary Tree.

Monday, July 09, 2012

Why re-writing software is sometimes better than fixing the old stuff

This blog post is mostly about unstructured, un-thought-out, and mostly naive and fleeting thoughts about the software "industry" at large.

We always tend to build on what others have done, and hence stand on the proverbial shoulders of giants. This ensures that some steady progress is always made in the forward direction. Imagine every generation trying to start with the invention of the wheel. The car would probably never have been.

Simulated annealing on the other tends to suggest that while finding a solution to a problem, one might get stuck at a local-maximum, which in practice translates to architectural bottlenecks that a software might have, which prevent it's furtherance in the direction of positive progress. These are times when one needs to take a step back and look at the bigger picture, and probably try to re-architect the solution to the problem, or maybe even revisit the problem statement and see if it still is valid and holds in the current scenario.

(just realized my sentences are getting too long)

For example, if you try to scale a multi-threaded web-server, you've essentially run into a bottleneck with the number of threads/processes or the amount of memory you have (due to the per-thread/process memory overhead). However, if you take a step back and question the multi-threaded-ness of the web-server, you can think of select(2) or epoll(2) based solutions that use asynchronous I/O and do I/O multiplexing to handle thousands of simultaneous connections in a single thread.

It takes a lot of experience, good judgement, guts, and a sound-technical skull to know when to build on existing stuff and when to say "NO" and re-write the damn thing.

An algorithms book which mostly works using pictures

My algorithms professor, Dr. Michael Bender mentioned that explaining algorithms & data structures using pictures/diagrams/visuals, with text complementing them is is much more effective than teaching them using textual matter with diagrams complementing the text.

Hopefully sooner than later, I plan to create a pin board on pintrest with such pictures that will make your wildest data structures dreams come true! ~~(stay tuned for the link)~~ I've put it up here: Algorithms & data structures in pictures.

Dr. Bender actually manager to teach a whole class worth of students Van Emde Boas Trees in about an hour using this diagrammatic approach. In fact, a lot of the tricks were suggested by the students themselves because the way the structure was pictorially shown on the board.

Monday, July 02, 2012

[What If] An ad-supported MySql

Let's suppose that mysql was ad-supported. What would typical interactions with it look like?

mysql> CREATE TABLE table01(id INT, 
    name VARCHAR(400), 
    created TIMESTAMP default CURRENT_TIMESTAMP);
Query OK, 0 rows affected, 1 warning (0.06 sec)

mysql> show warnings;
+-------+------------+---------------------------------------+
| Level | Advertizer | Message                               |
+-------+------------+---------------------------------------+
| Ad    | Foo Bar    | Visit our store at http://foobar.com/ |
+-------+------------+---------------------------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO table01 (id, name) VALUES(100, 'Billy Joel');
Query OK, 1 rows affected (0.02 sec)

...
...
...

mysql> SELECT * FROM table01 LIMIT 3
+-----+------------------------------------------------+---------------------+
| id  | name                                           | created             |
+-----+------------------------------------------------+---------------------+
| 0   | Sponsored Row - Visit us at http://bazbaz.com/ | 2006-04-06 17:17:49 |
| 100 | Billy Joel                                     | 2012-01-02 15:27:33 |
| 101 | Pink Floyd                                     | 2012-01-02 15:27:40 |
+-----+------------------------------------------------+---------------------+
3 rows in set (0.02 sec)

mysql> SELECT * FROM table01 LIMIT 30
+-----+------------------------------------------------------------+---------------------+
| id  | name                                                       | created             |
+-----+------------------------------------------------------------+---------------------+
| 0   | Sponsored Row - Visit us at http://bazbaz.com/             | 2006-04-06 17:17:49 |
| 100 | Billy Joel                                                 | 2012-01-02 15:27:33 |
| 101 | Pink Floyd                                                 | 2012-01-02 15:27:40 |
| 102 | Dire Straits                                               | 2012-01-02 15:27:45 |
| 0   | To see more than 4 rows at a time, visit http://mysql.com/ | 2012-01-02 20:00:07 |
+-----+------------------------------------------------------------+---------------------+
5 rows in set (0.00 sec)

Well, thank FSF for Free & open Source Software!!

Monday, May 07, 2012

Maharashtrian Curd Rice (Dahi Bhat)

Summer is here, and so is a tasty and easy to make dish - Maharashtrian Curd Rice. I remember having this during my stay at Pune and I absolutely loved it, so I decided to make it at home, and it turned out pretty good. Here is the recipe in text and video and a link to the youtube video for those of you who want to try it too!

Enjoy!

Wednesday, February 15, 2012

Coffee (flavoured) Yogurt

It's really simple to make, delicious to eat, and a healthy food!

Procedure:

Take about ¹/₂ tsp. instant coffee powder (Nescafe, etc...) and make a fine powder with your fingers. Put the powder in a bowl.
Take about 2¹/₂ tsp. sugar and add to the same bowl.
Take about 1 cup (240 ml) of cold yogurt and mix thoroughly with the coffee powder and sugar in the bowl using a spoon.
Once it has formed a light brown homogenous paste, eat to your heart's content!
Enjoy!

Saturday, February 04, 2012

Broccoli Delights!!

Broccoli goes really well with Ginger, Garlic, Butter, Olive Oil, Parsley, Basil and Oregano. Any dish that includes these ingredients must necessarily be heavenly!

Here are some recipes that sound delicious!

Tuesday, January 24, 2012

SQL Query Optimization - Part-9 (Covering Indexes)

To speed up query performance, use a Covering Index that covers (indexes) not just columns that will be used in the WHERE clause, but also columns that will be used in the projection list of the SELECT query.

See this video for a more detailed explanation.