Monday, November 05, 2012

Twitter API: Y U No respond fast & why

There is a very interesting post about an application developer running into issues with the Twitter API timeout of 5 sec. I would encourage everyone to read it if you haven't already.

Based on what I read (and some assumptions), I am going to try and comment on what the problem could be and potential solutions. I am greatly interested in criticisms of the analysis and possibly different ways of looking at the problem and possible solutions to it.

[A] Assumption: Twitter stores tweets for its users in an Inbox for every user. So, every time someone I am following makes a tweet, it is copied into my Inbox. This means that if 1M people follow me, my tweet is copied to the Inbox of 1M people. Let's call this value the fanout of the tweet.

Every time twitter needs to copy a tweet I make to the Inboxes of my followers, it needs to fetch a list of people that are following me. It does this by querying the database and fetching the list of people that have me as one of the people they are following. MySql (and other databases) do not allow you to perform efficient offset based queries, and you need to either:
  1. Use cursors (which tie up server resources for the duration of the complete request) or
  2. Perform a complete scan of all records till the requested record offset
every time an offset based query is made.

This is fairly wasteful of resources, whichever way you look at it, when in fact better solutions exist as mentioned in the post above.

[B] Assumption: When you perform an API request to Twitter to fetch the list of followers for a particular user, I am guessing that Twitter looks up a cache to locate the list and if it is unable to find it in the cache, it goes and looks up the DB for the list of these followers.

The reason that the API returns only the list of follower IDs (instead of complete user information) is so that they can avoid Point Queries which are notoriously inefficient. Of course, network bandwidth could be another reason.

[C] Now that we have established the parameters and behaviour of the system with which we are dealing, we can try and think of a solution.

  1. Ensure that the list of followers we want to query are already in the cache so that we don't query the DB for the list. This can be done by making a web-request to the API without your credentials and then making the same request again with your credentials so that you aren't charged for the request that loads data into the cache. This will not always work, and Twitter might ban your IP.

  2. Binary search over the size of the result set that twitter can reliably return for a user. Start with 1 follower, and ramp up to 2, 4, 8, 16, etc... (double) on every request till twitter times out. Subsequently, perform a binary search between the size that failed (say 128) and the size the succeeded (say 256). You are guaranteed to have fetched 36 follower IDs/request before you hit the request that timed out. Suppose the upper limit was 128, you will not get anything in the subsequent 8 requests, which means you still got an average of 17 follower IDs/request. Store this number, and use this number to start the next time you perform requests for this user.

  3. Randomized solution: If you get 2 successive failures [timeouts] for a user, give up and try another user. With high probability, you will be able to query the followers of O(log n) users if you try to query O(c log n) distinct random users. This is because the probability of failure shrinks rapidly as you get consecutive failures for distinct users.

Do let me know if you spot something wrong or have other solutions.

Saturday, August 11, 2012

The words are not enough

In an earlier post I hinted at creating a book which teaches algorithms & data structures primarily using pictures. Here is a start. Caveat: Start from the bottom of the pin board to see the older posts first and work your way to the top.

The way I see it being used is that it should provide enough intuition that people should be able to look at the pictures and get the gist of the algorithm or data structure in under 3 minutes. You can always find more information (details) about the algorithm or data structure on other places on the internet. This book hopes to act as a place where one can:
  • Develop intuition about algorithms & data structures
  • View the guts of an algorithm or data structure graphically for easier recall
  • Quickly glance at the picture of an algorithm or data structure to recollect the underlying idea behind it (revision for interviews, exams, etc...)
I'll be extremely happy to receive feedback of any kind on this (what I think is a unique) way to learn algorithms and data structures so that I may progressively improve the quality of the content.

Thursday, August 02, 2012

SQL Query Execution Optimization opportunities missed: Part-1

There are some common queries that are used almost everywhere, but they seem to be not executed very cleverly by most (all??) SQL execution engines.

An example of one such query is the quintessential pagination query. Assume you have a schema for comments that looks like this:


CREATE TABLE COMMENTS(post_id INT, 
  comment_id INT, 
  comment_text TEXT,
  poster_id INT,
  posted_at TIMESTAMP,
  PRIMARY KEY (comment_id),
  UNIQUE KEY (post_id, posted_at, comment_id)
);


Now, you display the first page (say the most recent 10 comments) for a post (with id=33) using a query such as:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 0,10;


The LIMIT 0,10 syntax fetches at most 10 comments from offset 0 in the result set.

The most natural way to fetch the 2nd page would be to say:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 10,10;


Similarly, you can get the 3rd page by saying:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 20,10;


and so on...

You must have noticed that in case LIMIT 1000,10 is specified, a naive execution of the query plan involves fetching all the 1000 rows before the 10 interesting rows are returned by the database. In fact, most databases, will do just that. There is in fact a better way to execute such queries.

(Let's leave aside the discussion of whether pagination queries should be implemented this way. The observant reader will notice that we have an index on (post_id, posted_at, comment_id), and that can be used to fetch the next 10 results given the post_id, posted_at, and comment_id of the immediately preceding comment).

We know that most (of not all) databases use B-Trees to store indexed information. It is easy to augment the B-Tree to hold information such as "how many elements under this node (including the node itself)?" This information alone will let us very quickly (in O(log n) I/O look-ups) locate the node of interest. Suppose we want the entry from the COMMENTS table at offset 1000 with post_id=33, we will first perform a look-up for the first entry with post_id=33. Once we find this key, we can quickly (in O(1) time) determine how many entries are less than the located entry (since we already have this information cached at each node). Let the found node be at offset OFF1. Subsequently, we can query the B-Tree to find the entry that has an offset of OFF1 + 1000 (since we have the cumulative counts cached for every value in the B-Tree node!).

For example, consider the Binary Tree shown below:



The alphabets in the top half of every node are the keys and the number in the bottom half of the node is the count of the number of nodes in the sub-tree rooted at that node.

We can answer the following query on this tree in O(log n) time:

SELECT KEY FROM BINARY_TREE
  WHERE KEY > D
  ORDER BY KEY
  LIMIT 3,2;


i.e. We want to fetch the 5 smallest keys greater than 'D', but we want only the last 2 from this set. i.e. The 2 greatest keys that are a subset of the set containing the 5 smallest keys greater than 'D' (read that a few times to get a hang of it).

We proceed as follows:
  1. Find D
  2. Find Rank(D) = 4
  3. Find the KEY such that Rank(K) = Rank(D)+3 = 7. This happens to be the KEY 'G'
  4. Perform an in-order traversal to find the in-order successor of the node that we are on till we either run out of nodes or we have walked 2 nodes (whichever comes first)
Note: To compute the rank of a node we use the following formula:

Rank(Node) = SubtreeSize(Node->Left) + 1 [if Node->Left exists]
Rank(Node) = 1                           [otherwise]


Hence, we can see that we are able to satisfy this query in time O(log n + Result_Set_Size), where n is the number of elements in the Binary Tree.

Monday, July 09, 2012

Why re-writing software is sometimes better than fixing the old stuff

This blog post is mostly about unstructured, un-thought-out, and mostly naive and fleeting thoughts about the software "industry" at large.

We always tend to build on what others have done, and hence stand on the proverbial shoulders of giants. This ensures that some steady progress is always made in the forward direction. Imagine every generation trying to start with the invention of the wheel. The car would probably never have been.

Simulated annealing on the other tends to suggest that while finding a solution to a problem, one might get stuck at a local-maximum, which in practice translates to architectural bottlenecks that a software might have, which prevent it's furtherance in the direction of positive progress. These are times when one needs to take a step back and look at the bigger picture, and probably try to re-architect the solution to the problem, or maybe even revisit the problem statement and see if it still is valid and holds in the current scenario.

(just realized my sentences are getting too long)

For example, if you try to scale a multi-threaded web-server, you've essentially run into a bottleneck with the number of threads/processes or the amount of memory you have (due to the per-thread/process memory overhead). However, if you take a step back and question the multi-threaded-ness of the web-server, you can think of select(2) or epoll(2) based solutions that use asynchronous I/O and do I/O multiplexing to handle thousands of simultaneous connections in a single thread.

It takes a lot of experience, good judgement, guts, and a sound-technical skull to know when to build on existing stuff and when to say "NO" and re-write the damn thing.

An algorithms book which mostly works using pictures

My algorithms professor, Dr. Michael Bender mentioned that explaining algorithms & data structures using pictures/diagrams/visuals, with text complementing them is is much more effective than teaching them using textual matter with diagrams complementing the text.

Hopefully sooner than later, I plan to create a pin board on pintrest with such pictures that will make your wildest data structures dreams come true! (stay tuned for the link) I've put it up here: Algorithms & data structures in pictures.

Dr. Bender actually manager to teach a whole class worth of students Van Emde Boas Trees in about an hour using this diagrammatic approach. In fact, a lot of the tricks were suggested by the students themselves because the way the structure was pictorially shown on the board.

Monday, July 02, 2012

[What If] An ad-supported MySql

Let's suppose that mysql was ad-supported. What would typical interactions with it look like?

mysql> CREATE TABLE table01(id INT, 
    name VARCHAR(400), 
    created TIMESTAMP default CURRENT_TIMESTAMP);
Query OK, 0 rows affected, 1 warning (0.06 sec)

mysql> show warnings;
+-------+------------+---------------------------------------+
| Level | Advertizer | Message                               |
+-------+------------+---------------------------------------+
| Ad    | Foo Bar    | Visit our store at http://foobar.com/ |
+-------+------------+---------------------------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO table01 (id, name) VALUES(100, 'Billy Joel');
Query OK, 1 rows affected (0.02 sec)

...
...
...

mysql> SELECT * FROM table01 LIMIT 3
+-----+------------------------------------------------+---------------------+
| id  | name                                           | created             |
+-----+------------------------------------------------+---------------------+
| 0   | Sponsored Row - Visit us at http://bazbaz.com/ | 2006-04-06 17:17:49 |
| 100 | Billy Joel                                     | 2012-01-02 15:27:33 |
| 101 | Pink Floyd                                     | 2012-01-02 15:27:40 |
+-----+------------------------------------------------+---------------------+
3 rows in set (0.02 sec)

mysql> SELECT * FROM table01 LIMIT 30
+-----+------------------------------------------------------------+---------------------+
| id  | name                                                       | created             |
+-----+------------------------------------------------------------+---------------------+
| 0   | Sponsored Row - Visit us at http://bazbaz.com/             | 2006-04-06 17:17:49 |
| 100 | Billy Joel                                                 | 2012-01-02 15:27:33 |
| 101 | Pink Floyd                                                 | 2012-01-02 15:27:40 |
| 102 | Dire Straits                                               | 2012-01-02 15:27:45 |
| 0   | To see more than 4 rows at a time, visit http://mysql.com/ | 2012-01-02 20:00:07 |
+-----+------------------------------------------------------------+---------------------+
5 rows in set (0.00 sec)


Well, thank FSF for Free & open Source Software!!

Monday, May 07, 2012

Maharashtrian Curd Rice (Dahi Bhat)

Summer is here, and so is a tasty and easy to make dish - Maharashtrian Curd Rice. I remember having this during my stay at Pune and I absolutely loved it, so I decided to make it at home, and it turned out pretty good. Here is the recipe in text and video and a link to the youtube video for those of you who want to try it too!

Enjoy!

Wednesday, February 15, 2012

Coffee (flavoured) Yogurt

It's really simple to make, delicious to eat, and a healthy food!

Procedure:
  • Take about 1/2 tsp. instant coffee powder (Nescafe, etc...) and make a fine powder with your fingers. Put the powder in a bowl.
  • Take about 21/2 tsp. sugar and add to the same bowl.
  • Take about 1 cup (240 ml) of cold yogurt and mix thoroughly with the coffee powder and sugar in the bowl using a spoon.
  • Once it has formed a light brown homogenous paste, eat to your heart's content!
  • Enjoy!

Saturday, February 04, 2012

Broccoli Delights!!

Broccoli goes really well with Ginger, Garlic, Butter, Olive Oil, Parsley, Basil and Oregano. Any dish that includes these ingredients must necessarily be heavenly!

Here are some recipes that sound delicious!

Tuesday, January 24, 2012

SQL Query Optimization - Part-9 (Covering Indexes)

To speed up query performance, use a Covering Index that covers (indexes) not just columns that will be used in the WHERE clause, but also columns that will be used in the projection list of the SELECT query.

See this video for a more detailed explanation.

Sunday, January 08, 2012

Solve this recurrence

T(n) = T(n - √n) + T(√n) + n

Post your answers/solutions in the comments. We were asked this in the final exam in CSE-548 (Analysis Of Algorithms taught by Dr. Michael Bender).

Ayurvedic Way of Life

Follow the principles mentioned here and you should be good to go!!

Also, here is another very informative article about the ayurvedic view on food.

Broccoli should be had fresh!!

Broccoli should be had fresh and not frozen.
  1. Frozen Broccoli is not as firm as Fresh Broccoli
  2. Frozen Broccoli doesn't taste as good as Fresh Broccoli
  3. Frozen Broccoli doesn't contain the broccoli stalk which is not only more nutritious, but also more fibrous than the brocolli florets
  4. According to this page, "frozen broccoli has twice as much sodium as fresh (up to 68 mg per 10 oz. package), about half the calcium, and smaller amounts of iron, thiamin, riboflavin, and vitamin C."
Peas on the other hand should be had frozen since (from here), "Peas are high in sugar content, so they have to be blanched before freezing to stop enzyme activity from turning sugars into starch and ruining their fresh sweet flavor."

Focaccia reprise

Made Focaccia Bread again after about 2 years of absence from baking. Contains garlic, thyme, oregano, dried basil and olive oil. This is how it turned out:
Ingredients, motivation and housing courtesy Poorvi Matani, and oven & electricity courtesy Panna Parikh. A big thanks to my teachers at IHM and my mom who encouraged my culinary streak ;)

This version lacked rosemary since I didn't have ready access to any. Next batch should have that plus olives, sun-dried tomatoes and maybe some other veggies.

For the complete recipe:

  1. Prepare a mixture of all the herbs (seasonings, namely thyme, oregano, basil) and virgin olive oil. You can use fresh basil instead of dried basil
  2. Saute garlic (important step) in butter and add to the olive oil after it's somewhat cooled
  3. Let this sit in a closed container for anywhere between 4 hours to 1 week
  4. Make the bread dough as explained in this post (or any other bread dough recipe that you are comfortable with)
  5. After the first rise (step-7 in the post), knead in the olive oil and seasoning mixture into the dough, and let it rise a 2nd time in a flat pan (as shown in the picture above)
  6. After it has almost doubled in size, plonk it carefully into a pre-heated oven (410F) for ~15-20 minutes or till it's done
  7. Enjoy!


The last post about something similar was this recipe for Garlic & Herb Chapati.

Monday, October 24, 2011

What business do you want to be in?

Dr. Bender once mentioned in class that memories are created when one experiences joy strong emotions, and in the context of algorithms, when one discovers a solution to some problem.

About a month down, I've noticed that experiences are what I'm going to die with. I vividly remember this episode in school when we were in the middle of the gymnastics competition and were one short of someone on the roman rings. I decided to give it a shot (even though I had never tried it before) and 2 minutes later I found myself stuck "up there". I don't remember all the things that I did right or who eventually won, but I surely do remember getting stuck!

If you are from my school and remember this episode, please leave a comment. I'll be looking forward to hearing from you!

I want to be in the business of creating memories.

Edit: Fixed the incorrect quote in the first line.

Tuesday, September 20, 2011

The Integer (0/1) Bounded Knapsack Problem

Quoting the problem statement from Petr's blog:

"The bounded knapsack problem is: you are given n types of items, you have ui items of ith type, and each item of ith type weighs wi and costs ci. What is the maximal cost you can get by picking some items weighing at most W in total?"

Here, we assume that the knapsack can hold a weight at most W.

We shall describe 3 solutions to this problem, each of which reduces the complexity of the solution by a certain amount.

  1. O(W*n*M): The first solution is the standard knapsack solution solution, which just involves copying each item instance of every item type to get a total of ui items of the ith type, each of which represents a different item having the same wi & ci as the item of the ith type.

    It is easy to see that if we have a mean of M number of items of each type, the complexity of this solution is O(W * n * M)
  2. O(W*n*log(M)): The second solution involves rewriting items such that we have a fewer number of items of each type to work with. We shall then solve the transformed problem using the same technique as the standard knapsack problem (as in the solution above).

    Suppose we have 21 instance of the ith item type, weighing wi and costing ci. We observe that 21 can be rewritten (using powers of 2) as:
    21 = 1 + 2 + 4 + 8 + 6

    We basically add powers of 2 till we exceed the sum we want to create (21 in this case). When we exceed the sum, we just add the difference (6 in this case) between the number (21 in this case) and the current sum we have (15 in this case = 1+2+4+8).

    Now, the new items are created by combining 1, 2, 4, 8, and 6 items of the original instances. These newly created items will have the following weights and costs:

    Number of items of the original type combinedWeightCost
    11*wi1*ci
    22*wi2*ci
    44*wi4*ci
    88*wi8*ci
    66*wi6*ci


    Hence, we have effectively reduced the number of items from 21 to 5. In general, we replace the M in the solution above with log(M) to get a final complexity of O(W * n * log(M)).
  3. O(W*n): The third and most efficient solution to this problem is the one that took me almost 3 days to figure out and understand. The short post on petr's blog actually contains all the information needed to understand it, but I'm not that comfortable reading terse explanations; especially when it's a new idea, so here is something more verbose.

    There is also a paper and a section in a book dedicate to this problem, but I find them quite hard as well.

    Let's assume that this row (in the DP table of Items (rows) v/s Weight (columns)) represents the best configuration (maximum cost) for the knapsack at every weight 0 .. W that includes all item types from 1 .. (k-1)

    Weight0123456789101112131415
    Item (k-1)0557151519202123303342424951


    So, the basic observation is that DP(k-1), w represents the best solution including all items from 1 .. (k-1) (both inclusive) with a knapsack of weight (w).

    To compute DPk, w, we consider the possibility of including either 0, 1, 2, 3, 4, ..., ui items of the ith type to the previous best solution including all items from 1 .. (k-1). This gives us the following recurrence:

    DPk, w = max (
        DP(k-1), w,
        DP(k-1), w-1*wk + 1*ck,
        DP(k-1), w-2*wk + 2*ck,
        DP(k-1), w-3*wk + 3*ck,
        ...,
        DP(k-1), w-ui*wk + ui*ck)

    An interesting observation here would be that we only deal with every wkth element from the previous array to decide on any element in the current array.

    Let us for a minute ignore the fact that we have only a fixed number of items of each type and instead assume that we have an infinite number of items of each type. Let us assume that item i weighs 3 and costs 9. If we make that assumption, we see that the best solution at DPk, 12 is represented by taking the maximum of the cells colored in green below.

    Weight0123456789101112131415
    Item (k-1)0557151519202123303342424951
    3634373242


    What we have essentially done is added:
    0*9 to DP(k-1), 12
    1*9 to DP(k-1), 9
    2*9 to DP(k-1), 6
    3*9 to DP(k-1), 3 and
    4*9 to DP(k-1), 0

    The problem has reduced to finding the maximum value among all the values of the array that is formed by picking every wkth element from the previous best solution and adding some multiple of ck to it.

    We also notice that it doesn't matter what values we add to DP(k-1), w (for the kth item) as long as they are ck apart. Hence, if the weight of the knapsack is W and the weight of the kth item is wk, we can add the following to every wkth element in the previous array.

    Column to which value is addedValue Added
    0(W/wk)*ck
    1*wk(W/wk - 1)*ck
    2*wk(W/wk - 2)*ck
    3*wk(W/wk - 3)*ck

    If we now reintroduce the requirement that each item type will have only a bounded or fixed number of instances (i.e. uk is bounded), then the problem reduces to finding the maximum value in every window of size uk in the array we just computed. This can easily be done by pre-processing the array to maintain the largest element in every window of size uk.

    However, the actual value of the cost at any cell needs to be computed by adding back any differential that we may have subtracted to ensure that every entry is ck apart and that the first entry is the one which has the maximum multiple of ck subtracted from it. For the jth array element in the current sub-array we are processing, this value will be (W/wk - j)*ck (since we have subtracted that much extra from those elements, we must now add it back).

    Hence, we can safely say that we shall have wk different sub-problems that we shall try to solve assuming that we can use a single computation for every wkth element.

    This allows us to find the value at every cell of the (k x W) matrix is amortized time O(1) (we have amortized the cost of pre-computing the wk arrays across all the element accesses). This makes the total running time of this algorithm O(W * n).

Thursday, July 14, 2011

Vegetable Ratatouille

Here we go!!

So, we'll be following this recipe for the most part. There are some diversions that I shall mention here. These are the only things I changed from the recipe above.

Tomatoes: We shall make a Ratatouille that has a red (tomato) gravy and the vegetables will be added to that gravy. For this, you'll need more tomatoes (at least 4 more).

Basil: I used dry basil instead of fresh basil, since I couldn't find any where I stay. I used lesser oil at every stage (use about half the amount of oil everywhere).

Olive Oil: Do NOT use Extra-Virgin Olive Oil for cooking. Do NOT use Olive-Pomace Oil since it doesn't have a good taste. Use ONLY Pure-Olive-Oil.

Salt: ALWAYS use Salt to-taste instead of measuring it since the saltiness of salt varies from region to region.

Ginger: I also use grated ginger (or ginger paste) along with the garlic pieces.

Bay Leaf: I also used bay-leaves (3 nos.) while sautéing the onions (along with the garlic and ginger).

After the onions are half done, add pieces of (at least) 4 tomatos and cook till the tomatoes are soft. Mash them to form a paste. Follow the recipe mentioned here. Once the other vegetables are cooked, add them to this tomato gravy and cook for about 3 mins. (or till steaming). Cover and let the vegetables soak in the smell for about 15 mins. Serve with rice (you can use brown rice since it tastes really good and is a healthier option when compared with white rice).

Wednesday, July 06, 2011

You know you are working with the right people when ...

... you get this in the post :) :)


Thanks @yegg :-)

Friday, June 10, 2011

A plea to all software developers: Use Units

Have you ever come across configuration files that read:

MAX_BUFFER_SIZE = 4096
MAX_TIMEOUT     = 20

WTF!!

4096 what?? bits, bytes, kilobytes, megabytes?

20 what?? mill-second, second?

Please, please make units part of the config. option and say:

MAX_BUFFER_SIZE_BYTES = 4096
MAX_TIMEOUT_MILLISEC  = 20