Wednesday, December 29, 2010

Practical SVD for the curious reader

Let's see an example of how SVD can be used perform seemingly daunting tasks.
Consider that we are implementing a rating system for a contest and we want to know player that may have similar skill. Additionally we would like to know the hardness of the tasks in the contest since they are not known apriori.

We build a a matrix that holds player number on the y-axis and task number on the x-axis. Let us suppose that we roughly know the first 3 problems to be easy, the next 3 to be of medium difficulty and the next 3 to be hard. The built matrix would look like this:

# 0  1  2  3  4  5  6  7  8
[ 1, 1, 1, 0, 0, 0, 0, 0, 0 ], #0
[ 1, 1, 1, 1, 0, 0, 1, 0, 0 ], #1
[ 1, 0, 0, 0, 0, 1, 1, 1, 0 ], #2
[ 1, 1, 1, 1, 1, 1, 0, 0, 0 ], #3
[ 1, 0, 1, 1, 0, 0, 1, 1, 1 ], #4
[ 1, 0, 1, 0, 0, 0, 0, 0, 0 ], #5
[ 1, 0, 0, 1, 1, 1, 0, 0, 0 ], #6
[ 1, 0, 0, 1, 1, 1, 1, 1, 1 ], #7

Now, we compute the SVD of this matrix and keep only the best 4 feature sets.

We get [U=]
[[-0.22828916 -0.50341088  0.06758431 -0.27577721]
 [-0.3796649  -0.38290566  0.21189598  0.16725752]
 [-0.29514798  0.33343393  0.18247298 -0.80693185]
 [-0.4283514  -0.28592541 -0.50767041  0.05970937]
 [-0.42379852  0.09624233  0.5690528   0.38393312]
 [-0.18451958 -0.30010945  0.12296957 -0.24683212]
 [-0.31511921  0.15603403 -0.56486439  0.03629706]
 [-0.46924257  0.53231036 -0.03863212  0.17781851]]

We get [S=]
[[ 4.86583872  0.          0.          0.        ]
 [ 0.          2.40125574  0.          0.        ]
 [ 0.          0.          2.02979097  0.        ]
 [ 0.          0.          0.          1.29857912]]

We get [V=]
[[-0.55984867 -0.14756061  0.02109021 -0.38852126]
 [-0.21297571 -0.48817872 -0.11242051 -0.03758748]
 [-0.33799385 -0.57307894  0.22851232  0.06799022]
 [-0.41435336  0.0482063  -0.16268579  0.63532176]
 [-0.24923004  0.16758689 -0.54742924  0.21086504]
 [-0.3098872   0.30644504 -0.45753181 -0.41053094]
 [-0.32221659  0.24115755  0.45560831 -0.06000612]
 [-0.24418998  0.40061814  0.35121531 -0.18880653]
 [-0.18353282  0.26176     0.26131788  0.43258946]]

Next, we need to compute U*S and V*S.
What is the significance of these matrices?
The matrix U*S will allow us to find user that are similar in skill to each other.
The matrix V*S will allow us to determine problems that have been misclassified or are actually similar in difficulty to each other.

After computing the U*S and V*S matrices, we will take the row-wise sum of squared differences (SOSD)and check the pairs of rows that have the least such difference. These are the rows that are of interest to us.

We get [SOSD(U*S)=]
[(0, 1, 1.0430999999999999),
 (0, 2, 4.6740000000000004),
 (0, 3, 2.7736000000000001),
 (0, 4, 4.7484000000000002),
 (0, 5, 0.29770000000000002),
 (0, 6, 4.4981999999999998),
 (0, 7, 7.9534000000000002),
 (1, 2, 4.7319000000000004),
 (1, 3, 2.2631000000000001),
 (1, 4, 1.9745999999999999),
 (1, 5, 1.2628999999999999),
 (1, 6, 4.2881999999999998),
 (1, 7, 5.2785000000000002),
 (2, 3, 5.8609),
 (2, 4, 3.7233999999999998),
 (2, 5, 3.1476999999999999),
 (2, 6, 3.6909999999999998),
 (2, 7, 2.7824),
 (3, 4, 5.7964000000000002),
 (3, 5, 3.2058),
 (3, 6, 1.4441999999999999),
 (3, 7, 4.8299000000000003),
 (4, 5, 3.7522000000000002),
 (4, 6, 5.8014999999999999),
 (4, 7, 2.7383999999999999),
 (5, 6, 3.6880000000000002),
 (5, 7, 6.3265000000000002),
 (6, 7, 2.5535000000000001)]

Analysis of SOSD(U*S): User #0 & user #5 seem to be skilled similarly. Similarly, the following pairs of users also seem to be of similar calibre: (0, 1), (0, 5), (1, 4), (1, 5), (3, 6), (4, 7), (6, 7)
Some of this analysis may seem questionable especially considering that we designated the last 3 problems to be hard. How are user #1 & #4 similar if this is the case?
Let's look at the V*S matrix.

We get [SOSD(V*S)=]
[(0, 1, 3.7989000000000002),
 (0, 2, 2.7381000000000002),
 (0, 3, 2.629),
 (0, 4, 4.7946),
 (0, 5, 3.6124999999999998),
 (0, 6, 3.1680999999999999),
 (0, 7, 4.6081000000000003),
 (0, 8, 5.6936999999999998),
 (1, 2, 0.9093),
 (1, 3, 3.3931),
 (1, 4, 3.3944000000000001),
 (1, 5, 4.5884),
 (1, 6, 4.6798999999999999),
 (1, 7, 5.5022000000000002),
 (1, 8, 4.2117000000000004),
 (2, 3, 3.5369999999999999),
 (2, 4, 5.8647999999999998),
 (2, 5, 6.8044000000000002),
 (2, 6, 4.0688000000000004),
 (2, 7, 5.8483000000000001),
 (2, 8, 4.8121),
 (3, 4, 1.6414),
 (3, 5, 2.8456000000000001),
 (3, 6, 2.806),
 (3, 7, 3.6351),
 (3, 8, 2.3344),
 (4, 5, 0.88270000000000004),
 (4, 6, 4.4261999999999997),
 (4, 7, 3.9102999999999999),
 (4, 8, 2.931),
 (5, 6, 3.6707999999999998),
 (5, 7, 2.931),
 (5, 8, 3.7172000000000001),
 (6, 7, 0.36359999999999998),
 (6, 8, 1.0225),
 (7, 8, 0.88270000000000004)]

The following pairs of problems seem to be equally difficult/easy: (3, 4), (1, 2), (4, 5), (6, 8), (7, 8), (6, 7)
You can immediately see a clique of (6, 7, 8) which were all originally designated as hard problems!
Similarly, (3, 4, 5) seem to be related and so do (1, 2). The reason that problems #0 can not be classified by the system could be that since it has been solved by everyone, not much can be said conclusively about it.
You would have noticed that users #1 & #4 have both solved problems #0, #2, #3 & #6. As it turns out, it seems that the features (as deduced by the system) have given a higher weightage to one or more of these problems, which is why users #1 & #4 have been clustered together.

Tuesday, December 28, 2010

SVD 101

I've been trying to find out more about SVD (Singular Value Decomposition) in the last few days, and here is what I have come up with (links marked with a star[*] MUST be READ). I'll try updating it as and when possible.
  1. http://www.funpecrp.com.br/GMR/year2007/vol4-6/xm0017_full_text.html
  2. http://www.cs.um.edu.mt/~csaw/CSAW04/Proceedings/09.pdf
  3. http://www.math.umn.edu/~lerman/math5467/svd.pdf
  4. http://www.seobook.com/lsi/tdm.htm
  5. http://web.mit.edu/snikolov/www/topicweb_verbose.pdf
  6. http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html
  7. http://en.wikipedia.org/wiki/Latent_semantic_indexing
  8. * http://web.ics.purdue.edu/~park283/wp-content/uploads/2010/10/Singular_Value_Decomposition_Tutorial.pdf
  9. http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
  10. http://www.miislita.com/information-retrieval-tutorial/singular-value-decomposition-fast-track-tutorial.pdf
  11. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-1-understanding.html
  12. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-2-computing-singular-values.html
  13. http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-3-full-svd.html
  14. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html
  15. * http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-5-lsi-keyword-research-co-occurrence.html (explains the significance of the U & V matrices)
  16. * http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm (Thanks to Shashank Singh for this link)
SVD can be used to:
  1. Determine Document Similarity
  2. Determine User similarity in a competition (or in a rating-based system such as movie recommendations)
  3. Perform Image compression
  4. Perform Principal Component Analysis
  5. and much more...
You can use the following tools to compute the SVD of a matrix:
  1. SciLab
  2. NumPy (SciPy)

Saturday, December 25, 2010

Suggest more recipients (in a chat conference)

If you don't already know, gmail has had a feature called Suggest more recipients for a while now.

From the link above, "Gmail will suggest people you might want to include based on the groups of people you email most often. So if you always email your mom, dad, and sister together, and you start composing a message to your mom and dad, Gmail will suggest adding your sister. Enter at least two recipients and any suggestions will show up like this."

I am planning on extending this even for the chat conversation start page in the chat client that I'm working on. So, if you start entering JIDs of people you want to chat with, the client should automatically suggest more participants.

Thanks to Sandy for the providing the link to gmail blog!!

Update (02/01/2011): This feature is now implemented and part of the dev. versions.

Thursday, December 23, 2010

Debugging PHP's $_REQUEST

The $_REQUEST page in the PHP manual, says that "$_REQUEST is an associative array that by default contains the contents of $_GET, $_POST and $_COOKIE. The variables in $_REQUEST are provided to the script via the GET, POST, and COOKIE input mechanisms and therefore could be modified by the remote user and cannot be trusted. The presence and order of variables listed in this array is defined according to the PHP variables_order configuration directive."

I was under the impression that $_REQUEST would only contain data from the $_GET and the $_POST variables. In my application, I was setting cookies that had the same key as one of the GET parameters that I was expecting. The script was hence getting data set by the cookie and not the GET parameter!! I went half mad trying to debug this and was suspecting some caching issue. Things got queerer when changing some of the GET parameters worked, but others refused to budge from their original value!! You can imagine the perplexing situation I was in. Thankfully, though it dawned on me to check the PHP manual to see what the $_REQUEST was supposed to contain and I was able to fix my code to use $_GET.

Since then, I have come across this very useful page which mentions the exact problem I had and also says why one should avoid using $_REQUEST.

Shortest Slowest path

A few weeks ago, I had was presented with a question: "You are given an n X n matrix. Each cell has a non-negative integer that denotes the cost of landing on that cell. Start from the top-left corner, compute the cost of the least-cost-path to reach the bottom-right corner of the matrix such that the sum of all costs incurred during the traversal is the least. Additionally, you are allowed to move only down or right."

Of course, this is just a slight variation of Dijkstra's Shortest Path problem, but I was surprised at the first solution that came to mind. I was trying really hard to not use Dijkstra's shortest path algorithm to solve this problem. I landed up linearizing all the n2 cells into n2 vertices in a graph and applied the Floyd-Warshall algorithm on the resulting matrix (of having side n2). Hence, the total complexity shot up to a whooping O(n6) (n being the side of the original matrix). If I had used a standard BFS or Dijkstra's algorithm, I would have been able to solve it with a runtime complexity of O(n2 log(n)).

However, this was a very interesting exercise in functional programming since I was able to piece together a solution by chaining solutions to existing problems. The composition looks something like this:
Solution(matrix) = FloydWarshall(Linearize(matrix))

Dyslexia Friendly

I made my home page Dyslexia Friendly today. Click here to find out more.

Friday, December 17, 2010

Help on selected text

Today, I've added this new feature that I call "Help on selected text."

If you select any word or phrase on my blog, a small window will pop-up that will ask you to if you want more information on the selected phrase. If you click on the More Info link, you will be shown some more information if it is available. Pictures if available will also be shown. I am using Duck Duck Go's API to fetch this information. If nothing is found, an appropriate message will be displayed. If you select a block of text that is greater than 60 characters in length, it will be ignored and no prompt for more information will be shown.

Let me know if you find it useful or want to integrate it on your site!!

Update: Sandy suggested making the opacity of the box depend upon the distance of the mouse pointer from the info-box. Implemented this.

Update: A separate page documenting this widget, providing samples and examples has been put up here.

Update: My homepage: http://dhruvbird.com/ is now enabled with the zero-click plugin!!

Thursday, December 16, 2010

TDDB: What I would do differently if I were to re-implement it

Working on TDDB was a lot of fun (for all of us) and we learnt a lot!! Probably much more than what we initially thought. From not knowing anything about databases to implementing our own meant that we really hit the jackpot in terms on learning stuff and being hand-on. Watching theory and practice collide head on is something we experienced then!!

If I were to work on TDDB from scratch all over again, these are the things I would do differently:
  1. Use HTTP for the transport layer. We implemented our own protocol for the transport. We got to learn a lot while doing so. However, since there's such a huge eco-system already built around HTTP, it would make sense to use it. Another advantage of using HTTP is that making clients in various languages (and even a client for the browser) is much less of an issue.
  2. Use sqlite/MySQL/postgreSQL/BerkeleyDB or some other underlying DB for the actual storage. We implemented our own disk format and transaction subsystem instead of using something already available. We learnt a HELLUVA LOT while doing so. However, if I were to do it again, I'd go with something I can just build on. I would need to implement the query parser, the optimizer, the execution engine and the lock manager though since for a distributed DB, these things would define the system.
  3. In general, use as many existing systems and build on them rather than implementing your own. While the latter approach means that one can learn much more, it also means that some of the "interesting" things won't get as much time and thought as they should.
  4. Research the common use-cases more thoroughly. I think we just scratched the surface of use-cases when it came to distributed query processing and I think that there is a much wider audience for such a system. Exploring this in more detail is something I hope to do in the future.

Interesting problem (string based)

Find the longest prefix of a string that is a palindrome.
O(n2) is easy, but you need come up with something that's O(n).

Hint: KMP

The psychological effects of a bad API

It makes everyone sick and your developers lose their creativity and grapple with the API rather than solving interesting problems and innovating.

Anyone can code, but only the fearless can be great

I personally think that every programming task needs to be approached from the perspective of being completely bug-free. When you write code, think that it is going to be deployed on a rocket or a space-craft and that the lives of a few hundred depend on the robustness of the code. Of course, you can cut yourself some slack by making practical simplifying assumptions about the specific task at hand, but I'd rather programmers started with the strictest requirements and then loosened them rather than the other way around (which seems to be happening more often these days).

Library v/s Framework

I sort of dislike frameworks (except for the ones that actually make your life easier). Libraries on the other hand are really great things!! Frameworks say: "This is all you are allowed to do and these are the ways you could do it. Additionally, I allow you the freedom of doing this insignificant operation in any way you see fit." On the other hand, libraries say: "If this is the input you feed me, this is the output you shall get. It's up to you to feed me the input and then you are free to do whatever you wish with the output I give you."

This change of mindset makes a world of difference while writing applications that do different things or just things differently from what the "framework" architect thought to be "the" way of doing things.

Django is a GOOD framework. It goes a long way to obviate all the mundane tasks, but at the same time provides a rich library-like interface to many convenience functions. IMHO, the Library way is the functional way where you can compose a complex operation from many small operations chained or composed together. A framework on the other hand reminds me of OO and something like the a template method pattern. Now, I really like the template method pattern when it is used minimally and with care. However, using this pattern for everything you do is just brainless. In the same way, frameworks aren't the solution to all design related problems.

jQuery is an awesome library. It's been true to its slogan of "Write less, do more". It's literally packed with a lot of convenience functions that lets you create really nice looking and functional UIs and web-applications without breaking a sweat. I think it gels in very well with the javascript and DOM eco-system. It doesn't try to be different and OOish (like Dojo) which is where I think it scores.

Frameworks cramp creativity whereas Libraries liberate the artist within the programmer. I hear a lot of people say "tell, don't ask", but I would say "ask and you shall be told."

Tuesday, November 16, 2010

SQL Query optimization - Part-7

(The you need referential integrity for more than referential integrity edition)

Why do we tag our blog posts with well, tags, so that people can glean meaningful information from these terse blocks of text. A post tagged with vacation or tutorial says so much about the blog post. Similarly, hinting the DB that a certain column has a similar looking entry in some other table means a lot to the execution planner. Yes, I'm talking about Foreign Key references.

If it finds such a hint, it can infer that:
  1. Every value in this column has some unique value in the column that it refers to.
  2. Since it is pointing to a UNIQUE column, joining a column from this table will match exactly 1 row from the table to which it refers to. Hence, while NATURAL JOINing the 2 tables, the execution engine can stop when it finds one match.
  3. A Foreign Key constraint forces you to make the referenced column UNIQUE. This means that a WHERE clause on that column will never match more than one row and the optimizer can take advantage of that. If such a query occurs as a subquery, then the optimizer can evaluate the sub-query first and replace the sub-query with a single constant which is the result of the inner query's execution.

All these optimizations can significantly reduce your query's execution time.

Monday, November 15, 2010

SQL Query optimization - Part-6 (How NATURAL JOINs work)

(The devil lies in the details edition)

To show how JOINing in databases works, we'll take up a very simple example and make some simplifying assumptions (which are true) about databases & indexes so that we can reason clearly about certain concepts and yet still understand the key insights required for comprehending how NATURAL JOINs work.

Let's assume that we have 2 tables X & Y that we want to join on a column named 'c'.
SELECT * FROM X JOIN Y ON X.c=Y.c WHERE <condition>

For the sake of simplicity, we shall draw parallels between the tables X & Y and 2 arrays AX & AY that are in-memory. Also let the size of the table X & Y be sA & sY. We shall consider a few cases and try to analyze how JOINing would work in these scenarios:
  1. No index on 'c' in either table: The DB is forced to perform a O(sX.sY) operation by running code that looks like a nested loop. Hence, this type of join is called a nested loop join. This is the most expensive type of join and must be avoided if the tables X & Y are very large.
  2. B-Tree index on 'c' only in table X: (This case is symmetrical to a B-Tree index being present on 'c' only in table Y -- but not really). This case is akin to one of the arrays being sorted and the other being unsorted. Here, the execution engine can loop over the table that doesn't have an index (the unsorted array in our analogy) and for each element, lookup the corresponding value in the table with the index (sorted array lookup using binary search). The cost for joining is O(sX.log sY) or O(sY.log sX). This is why which table has the index is important and the 2 cases aren't exactly symmetrical. It is advisable to have an index on the larger table -- to keep the linear component in the complexity as low as possible.
  3. Hash index on 'c' only in table X: (This case is symmetrical to a Hash index being present on 'c' only in table Y -- but not really). Again, this case is similar to the one above and depending up on whether table X or table Y has an index, the JOIN complexity will be either O(sY) or O(sX) respectively. Again, it is advisable to have an index on the larger table.
  4. Hash index on 'c' in both tables: This case isn't much different from the previous case except for the fact that the execution planner can decide whether to iterate over table X or table Y depending up on which table has a fewer number of rows.
  5. B-Tree index on 'c' in both tables: This is a very interesting and common case and is similar to the problem of trying to find the intersection of 2 sorted arrays AX & AY. It should be clear that the complexity of doing so is O(sX + sY). This bound is NOT tight and may be as low as theta(min(sX, sY)). Hence, it should be clear that some of the join strategies above can (in theory) be faster than this one. However, the execution planner may not always choose to use both indexes if it has heuristic data that suggests that iterating over one of the tables would be much cheaper (say because it contains only 2 rows). This join is most commonly referred to as the Merge Join because it looks like you are merging 2 arrays.

SQL Query optimization - Part-5 (Motivation)

(The better late than never edition)

I'm sure many of you have worked on web-apps (you know stuff like wikipedia, gmail, facebook, codechef, reddit, digg, duckduckgo or even your personal web-site which has some dymnamic/interactive content). I'm sure that out of these many, many of you have worked on sites that get traffic to the tune of a few hundred visitors per second. I'm sure you know how painstaking it is to run such a site and keep it up and running and responsive for all its users. Lots of things like the web-server, DB, file system, OS, etc... need to be tweaked and tuned and retro-fitted to suit the needs of the things around it.

The one place (in my experience) that can easily introduce bottlenecks is the DB (database). That's because it is something that is a real work-horse and is programmed by you (the application writer), who may not know what food it likes. It's very easy to write a query that runs 10x slower than what it really should be running and not notice this for a very long time.

Imagine the time you get slashdotted and need to support 5000 requests instead of the customary 50 requests/second. Don't you wish your DB doesn't just die due to the load?? Well, if you are one such person who is pedantic about the performance of your database, you are thinking right!

You will mostly learn how to optimize your DB through experimentation, but since others have been burned before, you might as well stay and take advantage of their experiences.

SQL Query optimization - Part-4 (Myths)

  1. If I have an index on every column of my table, all my queries will run fast. This is the equivalent of saying that adding all the good ingredients in a bowl and mixing away to glory will result in the world's best dish. Doesn't happen. The sooner you snap out of it, the better it is for you.
  2. Joins are expensive. Everything should be completely denormalized. This isn't an SQL optimization myth, just a SQL myth in general. Loosely speaking though, if your joins are all natural joins and the join columns are mostly unique and all your filter conditions are connected with a conjunction (AND clause) and you don't go crazy with GROUP BY, ORDER BY and HAVING, you will have made yourself a pretty fast query. The key to remember is to try and (anyhow) maximize the ratio of Rows returned to Rows visited by the DB engine to produce the result set.
  3. If I have an index on a column that I want to ORDER BY then that index will be used and the DB won't use a temporary table to sort the result set. Not really. If the JOIN is not on the first column that you want to ORDER BY, then the DB will most likely use a temporary table and sort. Having said that, even if you join on a column that you will later ORDER BY, the execution planner might decide to not order intermediate results by that column because it thinks that that would be the most efficient query execution plan. This can happen when exactly one table in the JOIN doesn't have an index on the join column and that will be the column that the DB will decide to iterate over.
  4. Streaming result sets. I really don't know how many DBs actually do this, but it is a killer feature to have. Of course, you need to craft your queries very smartly to take advantage of this.

I you've gotten this far and don't feel lost, you are into this shit big time, and it's time to read some stuff by the big boys (here and here).

Sunday, November 14, 2010

SQL Query optimization - Part-3

Now for some trivial stuff:
  1. Make sure that every column you query on has an (appropriate) INDEX. Creating a full-text index on something that you will do exact matching won't help. Neither will creating a HASH index on a column on which you intend doing < and > comparisons
  2. Avoid complex mathematical expressions in your WHERE clauses if you want indexes to be used. Databases are pretty dumb that way and practice and theory is still some way off in this area (which is why I want to bridge this gap!!). Don't do stuff like: WHERE 2*col1+3 < col2. This will probably NOT be optimized by the execution engine. Of course, if there is no other way out, then you must do it. I suggest you create another column (col3) which always gets the value: 2*col1+3, and query using col3 whenever you need this expression
  3. (This is an ugly condition and probably won't be optimized). Suppose you have 2 columns (col1 & col2) on the same table and want to select all rows WHERE col1 < col2. This will most likely result in a full table scan even if all columns have an index on them. Try to create a new column to flag this condition and select on that column if you need speed. Update: I can definitely see a way in which databases could use an index on (col1, col2) or (col2, col1) (if one exists), but you'll have to check with your particular database if that is the case (by doing an EXPLAIN or its equivalent).

I just noticed

This has been my most active year of blogging. 36 posts including this one till now. (see the bar on the right). Which probably means that I haven't been working as much this year as compared to the last few years.

SQL Query optimization - Part-2

Today we'll talk about whether to use JOINs or nested queries. TDDB (my db engine :D) doesn't support nested queries so I don't talk about them much, but that's just selfish of me (actually even MySQL didn't support them for a long time), but I've grown out of that, so here it goes. Nested queries can actually speed up your query by a HUGE amount.

Consider you have say 6 tables which you are joining and only 2 of them are actually involved in any logical reasoning like generating unique rows. The rest of the tables are vestiges of an over-enthusiastic DB architect or a graduate fresh out of college who thinks that everything should be in BCNF (been there; done that).

Now the other 4 tables are being used just to provide data for the purpose of projection and really have no reason being part of the JOIN. Hence, we refactor the query and create a temporary relation from the 2 tables that actually mean something. Most real queries are interested in the top 10 or 20 results only (ordered by some criteria) so we do whatever we want to on the join of these 2 tables.

This temporary relation is then joined with the other 4 relations and viola! This recipe serves 100 queries for the cost of 1 ;)

This is possible since all the extraneous tables have been avoided in the join that involved ordering, grouping and potentially many other expensive operations.

This can be optimized even more by using a materialized view for the inner query. However, that will increase the insertion time for those tables, and should be used with care. Besides, I don't think all DB engines support materialized views.

SQL Query optimization - Part-1

I'm planning to write some random posts on query optimization. Here's the first one. I'll get straight to the point 'cause I don't like beating about the bush (well not usually), so here it goes.

This post will mainly talk about optimizing queries that use the LIKE keyword.
  1. Try to avoid it if you can. Databases are generally bad at optimizing LIKE queries no matter how good they may be. More often than not, a LIKE query will result in a full-table-scan
  2. If your LIKE query looks something like this: colname LIKE 'something%' then there is some good news for you. You can convert this to colname ≤ 'something' AND colname ≥ 'somethingz' (I'm assuming that colname contains only ASCII text. If it contains unicode, use the last UNICODE character in place of 'z')
  3. No matter how hard you pray, colname LIKE '%something' will NOT be optimized by your favoirite DB engine. You will HAVE to reverse the contents of colname, store them in colname_reversed and query on that as colname_reversed LIKE 'gnihtemos%'
  4. colname LIKE '%something%' is just asking for trouble
  5. And so is stuff like LOWER(colname) = 'something'. This is because anything that transforms the column data can NOT use the index since the index indexes the ORIGINAL value, not the transformed value
  6. If you are using MySQL, the LIKE operator is case-insensitive so you can always replace all statements that look like LOWER(colname) = 'something' with colname LIKE 'something' and it will always work. Silly but useful
  7. To get a more detailed understanding of how indexes work, read up on B-Trees

Undecidable

Many times have I come across concepts that mimic the concept of an Undecidable problem. So, for all you people who don't know what an Undecidable (or partially decidable -- please pardon the terminoloty) problem is, you can check out the wiki page here or read the layman's definition below:

If a solution to the problem has a "yes" answer then it WILL be found, but if it has a "NO" answer, you'll never know so because you'll keep running the algorithm in the hope of finding the "YES" answer.

Now, it turns out that in the last few months I've come across many instances of such "undecidable" problems. I'll list out a few here:
  1. If you have good marks on a test, you are a good student, but if you don't then it doesn't mean that you are a bad student
  2. If you have lots of fans on your facebook page, then you are famous but if you don't, it doesn't necessarily mean that you aren't

So, the next time I have an argument/discussion with anyone and I mention "this is undecidable", I'll try to give them a link to this post as well.

Friday, October 29, 2010

Why Mongrel2 Rocks

I used Mongrel2 for my latest project lib-face and it was actually quite a pleasure to work with. Without wasting any more of your time, here are the reasons:
  1. It's fast
  2. It's simple to install and use. Though getting all dependencies is a bit time consuming
  3. It's very well documented
  4. The documentation isn't outdated
  5. I mailed the author asking for some clarifications and got a reply pretty fast. This means that if I get stuck using it, there is someone out there to help me till a community develops
  6. It gives out meaningful error messages that actually help
  7. Writing aync handling code is very very simple because you need to explicitly call a function to return. This means that you can hold on to a request for as long as you want
  8. Mongrel2 uses ZeroMQ as the transport. This means that I can potentially implement some other (non-HTTP) protocol and have my code handle requests for that protocol since it is now only talking to ZeroMQ
  9. It gives out meaningful error messages that actually help
  10. It's simple yet powerful
  11. It supports a lot of languages -- remember it's language agnostic!!

Wednesday, September 15, 2010

A very fast approach to auto complete (or search suggestions)

You've seen search engines suggest queries when you begin typing the first few letters of your search string. This is being done by Duck Duck Go as well as Google (to name a few). This is typically done by maintaining a list of past queries and/or important strings that the search engine thinks are worthy of being suggested to a user that is trying to find something similar. These suggestions are effective only if the search engine spits them out very fast since these should show up on the screen before the user has finished typing what he/she wanted to type. Hence the speed with which these suggestions are made is very critical to the usefulness of this feature.

Let us consider a situation (and a possible way of approaching this problem) in which when a user enters the first few letters of a search query, he/she is presented with some suggestions that have as their prefix, the string that the user has typed. Furthermore, these suggestions should be ordered by some score that is associated with each such suggestion.

Approach-1:

Our first attempt at solving this would probably involve keeping the initial list of suggestions sorted in lexicographic order so that a simple binary search can give us the 2 ends of the list of strings that serve as candidate suggestions. These are all the strings that have the user's search query as a prefix. We now need to sort all these candidates by their associated score in non-increasing order and return the first 6 (say). We will always return a very small subset (say 6) of the candidates because it is not feasible to show all candidates since the user's screen is of bounded size and we don't want to overload the user with too many options. The user will get better suggestions as he/she types in more letters into the query input box.

We immediately notice that if the candidate list (for small query prefixes say of length 3) is large (a few thousand), then we will be spending a lot of time sorting these candidates by their associated score. The cost of sorting is O(n log n) since the candidate list may be as large as the original list in the worst case. Hence, this is the total cost of the approch. Apache's solr uses this approach. Even if we keep the scores bound within a certain range and use bucket sort, the cost is still going to be O(n). We should definitely try to do better than this.


Approach-2:

One way of speeding things up is to use a Trie and store (pointers or references to) the top 6 suggestions at or below that node in the node itself. This idea is mentioned here. This results in O(m) query time, where m is the length of the prefix (or user's search query).

However, this results in too much wasted space because:
  1. Tries are wasteful of space and
  2. You need to store (pointers or references to) 6 suggestions at each node which results in a lot of redundancy of data

We can mitigate (1) by using Radix(or Patricia) Trees instead of Tries.


Approach-3:

There are also other approaches to auto-completion such as prefix expansion that are using by systems such as redis. However, these approaches use up memory proportional to the square of the size of each suggestion (string). The easy way to get around this is to store all the suggestions as a linear string (buffer) and represent each suggestion as an (index,offset) pair into that buffer. For example, suppose you have the strings:
  1. hello world
  2. hell breaks lose
  3. whiteboard
  4. blackboard

Then your buffer would look like this:
hello worldhell breaks losewhiteboardblackboard
And the 4 strings are actually represented as:
(0,11), (11,16), (27,10), (37,10)

Similarly, each prefix of the suggestion "whiteboard" is:
  1. w => (27,1)
  2. wh => (27,2)
  3. whi => (27,3)
  4. whit => (27,4)
  5. white => (27,5)
  6. whiteb => (27,6)
  7. and so on...

Approach-4:

We can do better by using Segment (or Interval) Trees. The idea is to keep the suggestions sorted (as in approach-1), but have an additional data structure called a Segment Tree which allows us to perform range searches very quickly. You can query a range of elements in Segment Tree very efficiently. Typically queries such as min, max, sum, etc... on ranges in a segment tree can be answered in O(log n) where n is the number of leaf nodes in the Segment Tree. So, once we have the 2 ends of the candidate list, we perform a range search to get the element with the highest score in the list of candidates. Once we get this node, we insert this range (with the maximum score in that range as the key) into the priority queue. The top element in the queue is popped and split at the location where the element with the highest score occurs and the scores for the 2 resulting ranges are computed and pushed back into the priority queue. This continues till we have popped 6 elements from the priority queue. It is easy to see that we will have never considered more than 2k ranges (here k = 6).

Hence, the complexity of the whole process is the sum of:
  1. The complexity for the range calculation: O(log n) (omitting prefix match cost) and
  2. The complexity for a range search on a Segment Tree performed 2k times: O(2k log n) (since the candidate list can be at most 'n' in length)

Giving a total complexity of:
O(log n) + O(2k log n)
which is: O(k log n)

Update (29th October, 2010): I have implemented the approach described above in the form of a stand-alone auto-complete server using Pyhton and Mongrel2. You can download it from here. (11th February, 2012): lib-face is now called cpp-libface and has a new home at github!

Some links I found online:
LingPipe - Text Processing and Computationsal Linguistics Tool Kit
What is the best open source solution for implementing fast auto-complete?
Suggest Trees (SourceForge): Though these look promising, treaps in practice can degenerate into a linear list.
Autocomplete Data Structures
Why is the autocomplete for Quora so fast?
Autocomplete at Wikipedia
How to implement a simple auto-complete functionality? (Stack Overflow)
Auto Compete Server (google code)
Three Autocomplete implementations compared
Trie Data Structure for Autocomplete?
What is the best autocomplete/suggest algorithm,datastructure [C++/C] (Stack Overflow)
Efficient auto-complete with a ternary search tree

Wednesday, August 18, 2010

What you've always wanted to know about GRE preparation

When I was preparing for the General GRE, it was a completely mad rush, with me having to ask anyone and everyone for help and advice. So, the conglomeration of all that information is what guided me throughout. Today, I am going to put that all down (combined with my own experiences) so that others don't have to grapple with the same questions and uncertainties and can really concentrate on studying for the exam.
  1. Don't fall into the trap of not booking a date!! You will find it hard to study if you haven't booked a date
  2. Book a date sufficiently into the future (say 2-3 months) and work towards it as if the date can not be postponed. I've seen many people postpone the date just because they can. Don't fall into that trap
  3. I used the Barron's book for everything
  4. Don't take the quant section lightly. Even though it is easy, it is time critical and there are many trick questions
  5. Even though the word list will seem daunting at first, keep at at and try to do as much of it as you can. I must have landed up knowing close to 3000 (out of the 3500) words in the Barron's word list
  6. Attack the High Frequency Words first (about 350). Make sure you use the list from both Barron and Kaplan. This file contains all such words and words that I found hard. Use the Barron's and Kaplan's list from here and make you own list of "hard" words (everyone will find a different set hard)
  7. Then do the words list-by-list. There are about 50 word lists in the Barron's book. Each word list can take you anywhere from 1 to 3 hours to do if you are doing it the first time. Towards the end, you should be able to do all the word lists in 2-3 days (assuming 5-6 working hours each day)
  8. Solve as many exercise problems in the Barron's book as you can in the quant section
  9. For quant practice, use The Nova GRE prep. book. The quant in this book is easier than what you will get in the GRE, so use this as your first round of preparation
  10. Take the 5 sample tests given at the end of the Barron's book. They are very close to the actual GRE test in terms of difficulty
  11. The Kaplan tests too are of GRE difficulty level (I would say a tiny bit more difficult, but that shouldn't be something you should factor in). The 6 practice exercises in Kaplan are way too easy (for quant) and the 30 min you get to solve them is sufficient. The quant in the 3 sample tests is harder and is of GRE difficulty level
  12. POWERPREP (the official GRE prep. software) is something you must do before the test. They give you 2 sample tests. Take them 1-3 days before the real test when you are sudo prepared for the test. These will give you a great indicator of what score you might land up getting. If you do badly in some section, try to spruce up your act in that section. The drawback of taking it so late (just before the test) is that you don't give yourself enough time to rectify your weak areas. However, I wouldn't want anyone to "waste" these tests before they are fully prepared for the simple reason that these tests are really good indicators of how much you might land up getting
  13. None of the other tests I took were of the GRE difficulty level, so make sure that you take all of the tests above before you take the GRE
  14. On the day of the test, make sure you arrive at the test center on time and have all the required documents
  15. That's about it and all the best!!

Wednesday, August 11, 2010

How things go wrong (reprise)

About 3 years ago, I wrote a post on How things go wrong. I should have made everyone I know read it since we just ran into an issue that bit us in the ass... Yes, it was caused as a result of:
  • Lack of return value checking from system calls and
  • Lack of sufficient error checking/handling code
and further complicated by a lack of logging in the application.

I can't begin saying how important each of them are. Oh well, I only have 2 things to say:
  • You need to learn for yourself
  • You can never be too careful
  • You can't learn from other's mistakes (yes, this and the 1st one are the same thing)

Thursday, August 05, 2010

Using Titanium Desktop to test web sites

I have been using Titanium Desktop for a while now at work, and am not at all impressed with the implementation or stability of the product and wouldn't want to see it being used in any serious project. HOWEVER, it is a brilliant concept which if done well can be humongously successful.

I was faced with the daunting task of testing a web-site written in Drupal the other day. Now, I don't know Drupal but have been told that its got these magic form elements which it used for validation. That couples with cookies meant that I would have to do a lot of work to test even slightly complex flows using python/perl/ruby::Mechanize. That apart, I can NEVER simulate the javascript and other asset loads that a normal user would exercise. For complete end-to-end-testing, I couldn't find anything that I would like using.

Here is where Titanium comes in. It doesn't have the sandboxing rules that normal browsers have, so you can load pretty much any page in a Titanium IFrame and do pretty much anything you want with it!!!!

I have given here an example of using Titanium to search for some text on Duck Duck Go.

To run the example, start Titanium Desktop as kboot --query "Your Search String"

Enjoy!!

Update: Here is a video showing how the automation works.

Saturday, July 03, 2010

Aaloo, mutter and tamater sabzi

This is the recipe for my current favourite vegetable. Translated to english, it means Potato, peas and tomatoes. (makes sabzi to serve 4). Dedicated to my mom since she makes this dish in the most awesome way I have ever tasted it to be and to sandy since he is going to learn it this time ;)



Ingredients:

Medium sized aaloo(potatoes); boiled: about 3 nos
Fresh green peas: about 100g
Medium sized tomates: about 4 nos.
Clarified Butter (Ghee): 1 Tbsp
Hing (asafoetida): 1/2 tsp
Cumin seeds (jeera): 1 1/2 tsp
Turmeric (haldi): 1/4 tsp
Salt (namak): to taste
Red chilli powder (this is mainly used for the red color): 1/4 tsp (or to taste)
Fresh green chilli (cut and seeded): 1/2 nos.
Dhania and jeera powder: 1 tsp
Fresh corriander leaves: (to garnish)




Procedure:
  1. You need not skin the boiled potatoes. Just cut it into about 1/2" X 1/2" cubes
  2. Heat the ghee and when it is hot, add the hing, jeera and green chilli to it. When the jeera starts popping, and water to it
  3. Add peas to the water, and boil them till they are soft
  4. Now, cut the tomatoes (unskinned) into small pieces and add it to the boiling water and peas. Keep boiling till the tomatoes have softened
  5. Now, add the cut potatoes into the mixture and add red chilli powder, dhania and jeera powder, turmeric and salt
  6. Boil for about 2 more minutes till the potatoes have soaked the gravy
  7. Garnish with fresh corriander leaves and cover or serve

Monday, June 28, 2010

Actions speak...

... louder than words

Thursday, June 24, 2010

Guess what?

            _________
            |*      |
            |       |
    (------------   |-------)
    (               |       )
    (       |--------       )
    (       |               )
     -------|   ------------
            |       |
            |      *|
            ---------

Or this (if you are having a problem with the one above ;))
            _________
            I *     |
            I       |
    (------------   |-------)
    (               |       )
    (       |--------       )
    (       |               )
     -------|   ------------
            |       I
            |       I
            |_____*_I


+5 to the first one who guesses correctly

Murphy's Law (for Python programmers)

Download it here
Please rename the file to murphy.py and from your interpreter:
import murphy

Wednesday, June 23, 2010

YASNOSC

Yet another SQL and NoSQL comparison
  1. Scalability != NoSQL

  2. Performance != NoSQL [not always]

  3. Scalability == Good schema design

  4. Performance == Good schema design

But...
  1. Performance != Scalability

  2. Scalability != Performance

(but we should also note that NoSQL isn't NOt SQL, but Not Only SQL)

Tuesday, June 22, 2010

A new water pump for our washing machine

(This blog post has the potential to be discursive and go in no particular direction, so I'll try my best to remain as unwavering as I can)

It all started off when our washing machine konked off and decided to stop working. My grandma sent it off to the repairers who fixed it and then had the brilliant idea to suggest attaching a water pump so that water from the tank above would flow in much faster. Now, anyone who has played with a syringe would know that no matter how hard you push (or pull), you just can NOT for the love of anything (or anyone) get more than a certain amount of liquid through the syringe (or in this case the pipe).

Now, our pipes are really old and are rusting from the inside. This is the reason that the flow of water has gone down as compared to the time when they were first installed. Even though I suggested getting the pipes changed due to the above mentioned reason (since that is the real problem, my grandma would hear nothing from me... Oh well... so be it...)

The brilliant engineer's suggestion came as a massive shock to me and it almost felt funny. Anyways, the pump has been installed and guess what... The flow of water in the pipe is still the same!! I wonder how that happened. Isn't the pump working (scratch head, etc...)

The rate of flow of water in a pipe depends upon many things. One of them being the inner diameter of the pipe and another being the friction offered by the sides of the pipe. I don't know the exactly formula, but I am guessing it is easy to come by. The pressure exerted by the water in the tank is more than sufficient to max. out the flow rate in the pipe and I would have guessed those engineers to know that. I guess it's back to the drawing board for them ;)

Friday, June 11, 2010

How to write bug-free code

Listen hard and listen fast. Listen once; I won't repeat myself...

While writing code, if you ever say to yourself:
  • Oh! but this will hardly ever happen OR

  • The user will never do this OR

  • What are the odds of this happening??

Please please please go ahead and write some code and comments to handle that situation. You can't ever be too careful.

Signing off,
Captain S.K.Eptic

Tuesday, June 01, 2010

Python generators and the dict.items() function

During the course of yesterday's session, there was an example of iterating over a dictionary(dict) in python. The code looks something like this:

d = {
"name": "chuck norris",
"age": "positive infinity",
}
for k,v in d.items():
print "Key: " + str(k) + ", Value: " + str(v)

The d.items() function returns a list which will take up a non-trivial amount of memory for large dicts that we wish to iterate over. For simplicity's sake if we assume that each pointer takes up 8 bytes and the python tuple type takes up 32 bytes(say) and there are 1 million(106) entities in the dict(d), then we land up with 106 x 56 (Two 8 byte pointers, each to each the key and the value objects, 8 for a pointer to the tuple and 32 for the tuple object itself) which is about 56MB of memory used for just iterating over the dictionary.

If however, we use generators, then we can save all that memory.
However, you can directly print the result of d.items(). Printing a generator object doesn't do anything spectacular, so we will need to create a proxy object to print the result of generation if string coercion is requested. The code for doing so is shown below. Notice that we use the .keys() function within our custom generator. We won't need to use it if we really were the dict object and had handles to the internal variables.

d = {
"name": "chuck norris",
"age": "positive infinity",
}

def dict_items(d):
"""
Returns an object which can be iterated over in a for loop
as well as printed. This is achieved by returning an object
which is both iterable as well as convertible to a string.
However, iteration involves using a generator for fetching
the next value in the sequence
"""
def dict_generator():
dkeys = d.keys()
for k in dkeys:
yield (k, d[k])
dg = dict_generator()

def dgen_to_str():
return str(d.items())

class Dummy(object):
def __getattr__(self, attr):
return getattr(dg, attr)
def __str__(self):
return dgen_to_str()
def __iter__(self):
return dg

proxy = Dummy()
return proxy

diter = dict_items(d)
print diter

for k,v in diter:
print "Key: " + str(k) + ", Value: " + str(v)


However, the major difference comes when you try to do something like this:

for k,v in d.items():
d.clear()
print "Key: " + str(k) + ", Value: " + str(v)

The code above will work as expected and print all the elements of the dict(d)


for k,v in diter:
d.clear()
print "Key: " + str(k) + ", Value: " + str(v)

However, the code above will throw an exception since our generator caches all the keys and dereferences the dict(d) to get the value for the corresponding key.

So, applications that mutate the dict while iterating over it won't quite work as expected.

Python Interest Group(PIG) Act-1; Scene-1

We had the first session of the python interest group(affectively called PIG) at Directi yesterday (1st June, 2010).

Here are the links to the presentation in:

You will need python 2.5 to run the sample programs.
The documentation for python 2.5 is available here

Happy Hacking!!

5 months of salad... and counting...

I started having salad for lunch daily sometime early January, 2010. Since then, it's been fairly smooth sailing and I've added a few ingredients to the mix.

A lot of people have asked me if I got bored of the monotony of salad, but I highly disagree with them all. The deal with salad is that you get so many flavours in every bit that it's really hard to get bored of it. Besides, it's fairly juicy and crunchy that you enjoy having it -- really!!

Either ways, I have added a few extra ingredients over the months, so would like to share those with the very few and faithful readers of this blog.
  • Hazelnuts (halves)

  • Cashews (halves)

  • Pomegranates (in season)

  • Dates

  • Dried Anjeer (figs)

  • I've pretty much knocked off the mixed herbs in the dressing and just stick to Olive Oil and Honey

  • The baby corn is fresh baby corn, which needs to be stripped off it's covering; like you would for normal corn. The baby corn which results is really so sweet that you can eat it as-is!!

Enjoy your salad people!!

Monday, May 31, 2010

Can't come to terms with inheritance

I've been racking my brains over inheritance for a while now, but am still not completely able to get around it.

For example, the other day I was thinking about relating an Infallible Human and a Fallible Human. Let's first define the two:
  • Infallible Human: A human that can never make a mistake. It's do_task() method will never throw an exception

  • Fallible Human: A human that will occasionally make a mistakes. It's do_task() method may occasionally throw a ErrorProcessingRequest Exception

The question was:
IS an infallible human A fallible human OR IS a fallible human AN infallible human?

The very nice answer I received was in the form of a question (I love these since it gives me rules to answer future questions I may have).

"Can you pass an infallible human where a fallible human is expected OR can you pass a fallible human where an infallible human is expected?"

It seems apparent that you can pass an infallible human where a fallible human is expected, but not the other way around. I guess that answered my question.

However, it still feels funny saying "An infallible human is a fallible human". Does anyone else feel queasy when they say it? It almost feels as if speaking out inheritance trees is like reading out statements from propositional calculus in plain English (the if/then implication connectives don't mean the same as that in spoken English). Does anyone else feel the same?

update: This thread on stackoverflow discusses the same issue.

Saturday, May 29, 2010

A few rules to live by

  • you can only lie to others, never to yourself
  • don't hurt others -- it will only come back to you
  • you can justify anything -- we all can -- but don't try too hard. you might end up lying
  • python rocks
  • java bites shite
  • java has resulted in more deaths and cases of insanity than road accidents world-wide
  • the average lifespan of a java programmer is 10 years less than that of a python programmer and 7 years less than the average for a human
  • what can kill chuck norris -- programming in java
  • listen to music you like -- don't pretend
  • eat food you like -- don't pretend
  • wear what you like (if you want to wear anything that is) -- don't pretend
  • criticize -- don't be diplomatic when you don't need to
  • praise -- it spreads good vibes -- but don't praise falsely -- it doesn't help at all
  • music will set you free ;)
  • do what you want to, not what you have to or need to -- sometimes you may "need/have" to do the things though; but don't stretch it too much
  • listen to "sunscreen" if you haven't yet
  • don't recommend it to others because it's "cool", but because you think it's worth doing
  • don't do anything "just" because it's "cool"
  • don't not to anything just because it's not cool
  • the arts are purgative and cathartic; write, act, sing, cook, code... anything is art and art is everything...
  • you may need to twist your finger to get the fat out(transated from a hindi proverb); but do it only if it's absolutely necessary
  • talk is cheap; actions are not
  • neither is code
  • show me the code you piece of shite

Saturday, May 22, 2010

We aren't responsible

I was reading the paper and The Mumbai Mirror's headlines a few days ago went something like this "Criminal Negligence Leaves WR Commuter Critical". The article was with reference to a commuter who had been injured by a plank that fell on him because of negligence on part of the workers working at Churchgate station. This injury unfortunately has left him comatose and he is currently in a critical state in hospital. I wish him a speedy recovery.

Coming from a software centric world where EULAs and licenses accompany any piece of code that is shipped, it seems only natural that no one is really responsible for anything that they do and that any action on their behalf may have negative influences on others. Others need to watch out for anything that may hurt or otherwise injure them. It is the user's and only the user's sole responsibility to ensure his/her fitness and properiety and the user must account for anything that may go wrong during the course of using the product/software mentioned.

I have yet to see any license that says that they take sole responsibility for their product and that they could be held responsible for any damage arising due to negligence on their behalf. The closest I have come to it is Knuth's exponentially increasing reward for finding bugs in TeX. As far as free software is concerned, we shouldn't expect it. However in case we are paying for the software, shouldn't we demand it? I mean we are literally shelling out hard cash for that piece of code. We seem to have a right to want it to be bug-free...

Either ways, it seems no one wants to take responsibility for their code, so why not make it a de-facto standard. Instead, if someone out there is willing to say that they will take responsibility for screw ups (of any) then they should include it in the license agreement.

It starts off with software saying that it may not work well and ends up at coffee cups and hot dogs saying that their contents may be too hot for consumption.

What next? From what I see, there may now be notices on every pavement saying that people should walk on it on their own risk. Each lift will say that it could crash any minute and if it does; even due to any manufacturing defect or installation glitch, then company providing it should not be held responsible. Airplane and trains will warn travelers before they enter saying "Please account for the fact that you may not make it if you step in". Books may say "Read at your own peril. Your eyes may gouge out while reading this book, but don't blame us!!". When will all this end???

Saturday, May 15, 2010

Events: How to use them to your advantage

A very interesting conversation between a colleague and me resulted in some equally interesting learnings as far as I was concerned.

The topic of discussion was about whether certain functionality in a software should be implemented via events or via function calls directly to the concerned component. More concretely, should the coupling be via events as in a pubsub based system or should it be via known function end-points on specific objects.

If the event based approach is used, there is lose coupling, and the coupling is on events. Using the other approach, there is strong coupling and one component needs to know about the interfaces of the other component.

Let's take a toy example to analyze this situation.

Consider an interior stylist that is developing an integrated home theatre system. Say there are multiple components involved:
1. Speaker system
2. Television
3. CD player
4. Satellite system

For all practical purposes, users should not need to turn on each of the separate components separately (as they do today). They need not even know what the individual constituents of the system are. For that purpose, having a separate button that turns on the Television, one that turns on the speakers and one that controls the CD player seem overkill. Just the press of a single "start CD" button should do the needful.

Let's discuss various ways in which we can achieve the desired effect:
  1. Strongly couple all the components together so that starting the CD player also triggers the TV and the speaker system.

    What happens if the satellite system is turned on? It must also turn on the TV and speaker system. Tomorrow if there is a new device "USB photo viewer" that needs to use the display that the Television provides, it will also need to do the needful.

    This method seems to waste a lot of developer time and seems to be prone to errors. What if someone forgets to turn on the TV? Also each component needs to handle the case of when the TV/speaker is already on, etc...

    The CD player will work only with that TV set and speaker system that speaks a language that it understands (tied to an interface). Vendors will try to tie users to their interface in the hope of selling them all the components.

  2. Instead, if we just decouple all the system and couple them on events, things become much clearer and easier to work with.

    The "button.enterteinment.cdplayer", "button.enterteinment.satellite" and "button.enterteinment.usbdevice" events should be listened to by each of the components and they should react accordingly.

    Another thing to remember is how to name the events. We should NOT name events as "start.enterteinment.cdplayer". That sounds very imperative. Naming plays a very important role in how the rest of our system is built around these events. Wrongly named events can cause a lot of confusion and screw up the system even more than what you thought capable of doing!!! Event names should be suggestive of "something happening" and not of "something being asked for".

    Accordingly, events should not be named "do-this" and "do-that". Instead, prefer names like "this-happened" and "that-happened" so that listeners can react to these events in the way they deem most fit.

    By naming events imperatively, we immediately constrict their use cases to what we think they should be; defeating the original purposes of using events -- which is to free event raisers from thinking of what is to be done in reaction to a raised event.

Sunday, May 09, 2010

One line to teach you Object Oriented Programming (OOP)

"When modeling objects, model behaviour, not attributes"
Courtesy: Ramki

Probably the hottest summer I've seen in Mumbai

This is by far the hottest and stickiest summer I have seen in Mumbai since my stint on this planet.

Sunday, May 02, 2010

Protocol Buffers v/s HTTP

I was discussing serialization costs with my immediate superior, Ramki when one thing lead to another and I landed up wanting to compare the serialization and deserialization costs of protocol buffers (a binary format) to those of HTTP (a text based protocol) in python.

After performing some tests (in python), I observed that protobuf was taking almost 4 times the amount of time to deserialize data as compared to the simplistic HTTP based header parsing. Of course, these 2 are meant for different purposes and Ramki mentioned that a fixed format protocol would save data on the wire since the attribute names (header names in HTTP) need not be sent on the wire; just the values (header values in HTTP) are sufficient.

Also a binary protocol should be much faster as far as serialization and deserialization is concerned, but we found out that python's pack and unpack are a bit slow OR it is blazingly fast at doing string operations.

Here is a representative output from one such run of the program below:

ramki serialization time: 0.125000 seconds
ramki deserialization time: 0.156000 seconds
protobuf serialization time: 0.453000 seconds
protobuf deserialization time: 0.453000 seconds
http serialization time: 0.047000 seconds
http deserialization time: 0.125000 seconds

The code:

import sys
import message_pb2
import time
import struct

def multi_serialize(o, n):
fragments = []
for i in xrange(0, n):
data = o.SerializeToString()
fragments.append("%d\r\n" % len(data))
fragments.append(data)
return "".join(fragments)

def multi_parse(input, n):
il = len(input)
start = 0
objects = []
for i in xrange(0, n):
rnPos = input.find("\r\n", start)
if rnPos == -1:
print "Premature end of input. Terminating..."
return None
lenStr = input[start:rnPos]
start = rnPos + 2
lenInt = int(lenStr)
# Read off lenInt bytes off the stream
data = input[start:start + lenInt]
start += lenInt
obj = message_pb2.Header()
obj.ParseFromString(data)
objects.append(obj)
return objects


def http_header_create(request, headers):
line1 = "GET %s HTTP/1.1" % request
hLines = [line1]
for k,v in headers.items():
hLines.append(k + ": " + v)
return "\r\n".join(hLines) + "\r\n\r\n"

def http_header_parse(input):
parts = input.split("\r\n")
line1 = tuple(parts[0].split())
headers = { }
for i in xrange(1, len(parts)):
h = parts[i].split(": ")
if len(h) == 2:
k,v = h
headers[k] = v
return (line1, headers)

def http_multi_serialize(request, headers, n):
fragments = []
for i in xrange(0, n):
fragments.append(http_header_create(request, headers))
return "".join(fragments)

def http_multi_parse(input, n):
il = len(input)
start = 0
objects = []
for i in xrange(0, n):
delimPos = input.find("\r\n\r\n", start)
if delimPos == -1:
print "Premature end of input. Terminating..."
return None
headerString = input[start:delimPos]
headerObject = http_header_parse(headerString)
objects.append(headerObject)
start = delimPos + 4
return objects

def ramki_serialize(obj):
totalLength = 0
attrs = [ ]
for k,v in obj.__dict__.items():
totalLength += (2 + len(v))
attr = struct.pack("H", len(v)) + v
attrs.append(attr)
attrs.insert(0, struct.pack("H", totalLength))
return "".join(attrs)

class RamkiDummy(object):
pass

shortStruct = struct.Struct("H")

def ramki_deserialize(input):
# For now, we lose attribute names
d = RamkiDummy()
packetLength = shortStruct.unpack(input[0:2])[0]
s = 2
ctr = 0
while s < packetLength+2:
# print "CTR: " + str(ctr)
attrLength = shortStruct.unpack(input[s:s+2])[0]
s += 2
# Read attrLength bytes of data
attrValue = input[s:s+attrLength]
s += attrLength
setattr(d, "attr" + str(ctr), attrValue)
ctr += 1

return d

def ramki_multi_serialize(obj, n):
stream = []
for i in xrange(0, n):
stream.append(ramki_serialize(obj))
return "".join(stream)

def ramki_multi_deserialize(input, n):
objects = []
s = 0
for i in xrange(0, n):
objectLength = shortStruct.unpack(input[s:s+2])[0] + 2
obj = ramki_deserialize(input[s:s+objectLength])
s += objectLength
objects.append(obj)
return objects

def main():

class Dummy(object):
pass

d = Dummy()
d.request = "GET"
d.resource = "/user/ramki/getVcard/"
d.version = "1.1"
d.destination = "localhost:8080"
d.custom1 = "434552"
d.custom2 = "no"

s = time.time()
input = ramki_multi_serialize(d, 10000)
print "ramki serialization time: %f seconds" % (time.time() - s)

s = time.time()
ramki_multi_deserialize(input, 10000)
print "ramki deserialization time: %f seconds" % (time.time() - s)

h = message_pb2.Header()
h.request = "GET"
h.resource = "/user/ramki/getVcard/"
h.version = "1.1"
h.destination = "localhost:8080"
h.custom1 = "434552"
h.custom2 = "no"

s = time.time()
stream = multi_serialize(h, 10000)
print "protobuf serialization time: %f seconds" % (time.time() - s)

s = time.time()
objs = multi_parse(stream, 10000)
print "protobuf deserialization time: %f seconds" % (time.time() - s)

hh = { "Host": "localhost",
"X-MessageID": "33",
"X-ACKMessage": "100",
}

s = time.time()
stream = http_multi_serialize("/user/ramki/getVcard/", hh, 10000)
print "http serialization time: %f seconds" % (time.time() - s)

s = time.time()
objs = http_multi_parse(stream, 10000)
print "http deserialization time: %f seconds" % (time.time() - s)

return 0

sys.exit(main())


This page claims that protobuff deserialization time is 3478ns whereas I am seeing a time of 4530ns which is expected since we are running on different hardware.

From: here, it seems as it protobuf is good at packing ints but not strings.

In fact, none of the deserialization times even come close to the 1250ns that I see for http parsing. This is mainly because those methods do type conversion which HTTP is not doing.
If that is introduced into the mix, I guess those costs will add up too. However, the application that I want it for doesn't really need it, and there will be many applications that don't.

In the link above, many of the methods, serialization takes more time than deserialization which is slightly curious.

Thursday, April 22, 2010

Inheritance and extension

I have been stumped by this issue for a while, so I decided to write about it.

Say you have an element container called SimpleList which exposes the following methods(apart from others of course):
1. add(element): Adds 'element' to the SimpleList
2. addAll(elements): Adds 'elements' to the SimpleList

These methods are extensible(can be overriden and hence extended).

They are implemented as follows in SimpleList.

# Contract: Will add 'element' to SimpleList or throw an
# exception in case of an error.
method add(element)
# Adds element to the underlying data store/array

# Contract: Will add 'elements' to SimpleList or throw an
# exception in case of an error. This method may throw an
# exception after adding 'some' of the elements from 'elements'.
method addAll(elements)
for element in elements
this.add(element)


Fair enough till now.

Up comes someone who decides they can do better and they extend SimpleList and create ABetterSimpleList!!

They are implemented as follows in ABetterSimpleList.

# Contract: Will add 'element' to ABetterSimpleList or throw
# an exception in case of an error.
method add(element)
# Adds element to an underlying cache since SimpleList
# may take a while to add() this element.
if cache.size() > 20:
this.addAdd(cache)
cache.clear()

# Contract: Will add 'elements' to ABetterSimpleList or throw
# an exception in case of an error. This method may throw an
# exception after adding 'some' of the elements from 'elements'.
method addAll(elements)
# Just call the base class' addAll method


All contracts have been satisfied, but can you spot the subtle bug?

Yes, there is an infinite recursion here!! Calling add() on ABetterSimpleList will add elements to the cache till the cache grows to 20 elements in length. Once that happens, it will call addAll() which will call the base class' addAll() which will call (what it thinks is) it's add() function, except that it had been overridden by us to call addAll()!!

Well, how do you solve this?? I don't have a concrete answer myself, but just scratching the surface has lead me to believe that there are 2 type of class extensions:
1. Extending for the base class (I call this framework extension)
2. Extending for users/subclasses (I call this normal extension)

In case [1], it is expected that the base class will consume your interfaces so you program for the base class. However, in case [2], your consumers are externals and not internals like your base class. Hence, you should now program for them.

An important side effect of this is that classes that are meant to be consumed by subclasses (case [2]) should NOT use their own overridable methods in their own implementation or else they CAN NOT give any guarantees to their consumers (subclasses in this case). However, framework extended classes (case [1]) SHOULD use their own public interfaces since that IS the purpose of them allowing method overriding on those interfaces. They are expected to be the consumers of their subclass' funtionality.

Examples of each type of class:
1. Framework Extension: Java's Runnable, C++'s streambuf
2. Normal Extension: Java's LinkedList, AbstractList. Typically anything that is layered falls into this category

Patterns for Concurrent, Parallel, and Distributed Systems

I happened to chance up on this great site which describes Patterns for Concurrent, Parallel, and Distributed Systems.

I landed up here searching for the proactor and reactor patterns, both of which are described very well.

http://www.cs.wustl.edu/~schmidt/patterns-ace.html

Tuesday, February 02, 2010

An experiment with carbs

About a month ago, Appu, Abbas and myself met and Appu happened to mention that an increased intake of carbs. generally lent itself to hunger pangs if one did not eat for a while.
For the last month, I've been having only salad for lunch, and daal, vegetables and chapati(3 nos.) for dinner and have also been limiting my carb. intake by way of reducing the amount of sugar I add to my hot beverages.
However, yesterday I had 4 liquor chocolates (which Devdas had so generously brought back from New Zealand), 5 milk pedhas (courtesy Jaineev) and 2 Kaju Katris (courtesy Mukesh), which considerably increased my carb intake for that day. Furthermore, I had 5 chapatis and extra helpings of other stuff yesterday for dinner. I don't know if it had any direct link to the carb. intake, but I will be monitoring this more closely now.

Moments

I visited Sandeep in Bangalore when I had gone for the Agile 2010 conference in Jan 2010. Good times followed ;)

Monday, February 01, 2010

Vanilla Milk using natural vanilla extract

I started making vanilla extract in July 2009 using adaptations of the innumerable recipes available online. It's now been 7 months since I started so I decided to try out the extract for making vanilla milk (which btw I absolutely love and up to this point made using pulverized vanilla beans).
  • Attempt-1: Measure about 1 cup (approx. 200 ml) of milk in a glass, pour it in a vessel and heat. Add 1/2 tsp vanilla extract to the glass (which has traces of cool milk). The last bit caused the trace amounts of milk in the glass to curdle. However, pouring the heated milk and sugar in the glass did not curdle the rest of it. The whole drink smelt of alcohol though.

  • Attempt-2: This time, I decided to not leave trace amounts of milk before adding the extract to the glass. Again however, I did get some smell of alcohol.

  • Attempt-3: This time, I decided to do some research online and figure out if anyone else is facing the problem of an alcoholic smell in their extract. As it turns out, it is expected to be this way for a real extract!! (feature, not a bug ;) ). I also happened to read that adding sugar reduces the alcoholic smell, and that when this extract is used in cooking, most of the alcohol will in fact evaporate and hence not leave behind that alcoholic smell. So, this time, I added some sugar to the glass before adding the vanilla extract, and after pouring in the hot milk, I stirred for about 4-5 minutes before consuming the drink. Viola!! No smell now!!


Another nice link explaining a lot of things.

Friday, January 29, 2010

Eggless Brownies Recipe

Makes 1kg Eggless Brownies



Ingredients:






















































Flour (sifted):195g
Dark Chocolate:160g
Fat (Butter + Oil)120g (80g + 40g)
Water:255ml
Cocoa powder:2-3tsp
Sugar (castor or powder sugar):2 cups
Salt (if not using salter butter):1/2 tsp
Vanilla extract/essence:1 1/2 - 1 4/5 tsp
Baking powder:1 1/5 tsp(for guey brownies) - 1 1/2 tsp(for cakey/fluffy brownies)
Chopped walnuts:40g




Procedure:



  1. Grease and dust an 8" X 9" pan (basically 72 square inches in area), and pre-heat the oven to 180 degree celcius.


  2. Heat the water on a low flame till it is warm/mildly hot and add to it 1/3rd of the flour (65g). Do not remove the utensil from the flame, and continue stirring till the mixture thickens.


  3. Add the oil to the chocolate and melt the chocolate either in a microwave oven or on a double boiler.


  4. For making cakey(not guey) brownies, mix the sugar with the butter and whisk/beat till the mixture is light and fluffy. For making guey brownies, just mix the sugar with the flour and water mixture you just made.


  5. Mix the baking powder, cocoa powder and salt (if using unsalted butter) with the rest of the flour and sift it again a couple of times to mix them uniformly well.


  6. Add the vanilla extract with the flour and water mixture and mix it well.


  7. Now, mix all the portions that you have with each other and add the walnuts to this mixture. Make sure you mix it well. Perform this step quickly and briskly so that the baking powder doesn't wear out by the time you start baking it.


  8. Transfer the contents to the greaded and dusted pan, and bake in the oven for 45-55 minutes (or till you think it is done).


  9. Remove frm over and wait for it to cool a bit (about 30-45 minutes), cut into pieces and enjoy!!




Saturday, January 16, 2010

One week of salad

I started having salad last week for lunch every day, and it's going pretty well as of now. Bought supplies for the next week today.



Ingredients:

  • Iceberg lettuce

  • Lettuce

  • Purple/Green cabbage

  • Brocolli

  • Red/Green/Yellow Capsicum

  • Tomatoes

  • Cherry Tomatoes

  • Baby corn

  • Fresh Green Peas

  • Finely chopped carrots

  • Green/Black Olives

  • Soaked and sprouted moong and moth

  • Extra Virgin olive Oil (for the dressing)

  • Honey (for the dressing)

  • Salt (for the dressing, added just before eating)

  • Mixed Herbs (for the dressing)


Procedure:

  1. Cut and mix whatever you want to have in the salad. I generally choose about 6-8 items every day, and mix and match in any order I feel fit.

  2. Introduce the dressing (generally just olive oil and honey) except for the salt since the salt will make the vegetables dehydrate and lose their water. I take this tossed salad to office, so it is about 3hrs time from the time I mix it to the time that I consume it.


Saturday, January 09, 2010

It's funny to know....

.... that something you have thought of doing and have done has been thought of by someone else across the world and done in almost exactly the same way as you have....

It's saddening and encouraging at the same time. Saddening because it's not something novel any more. Someone has "been there.. and done that..". Encouraging because you know that you are not having any insanely weird and impractical thoughts and ideas and that you are solving some real problem out there.

This is the second time that this has happened to me, and it is a real confidence booster I should say!!

The project under question is StickyLinks. The exact same thing was done in 1997 and is called WebCite. The funny thing is that it was made for the exact same problem that I was trying to solve, which probably explains why it turned out to be so similar.

The thing that still gets to me is that even the layout of the version page (view where a version of a certain page is displayed) is almost identical. Both projects use frames. Both projects have a bookmarklet.

However, there are ways I plan to extend StickyLinks, so stay posted!!

ps. My next idea: To develop a search engine that will let you search for already existing ideas/concepts/implementations of thoughts that you may have and may think are novel. I use the internet (esp. the search engines) a lot to search for projects that I may want to undertake, just to check if it's already been done by someone else. Most of the time I have to search a lot of times, tweaking my search keywords ever so slightly in the hope of a match. Many a times, I fail to chance upon the right set of keywords (the words that the original author used to describe his/her idea/project) and am under the impression that the idea I have is novel/un-thought of. However, many times it does happen that such a thing does not exist. However, it is only time that lets you ascertain this (if a similar concept exists, you hear of it sooner rather than later, assuming you have shared it with friends).

Chocolate Rum Balls

This is been pending for a while now.
Barbara asked me how to make these last year!!!! (realistically, about a month ago) and I haven't yet gotten back to her. That is pretty bad of me.
Here goes nothing....



Ingredients:




























Chocolate sponge cake:about 250g
Dark chocolate:about 250g (preferably &ge 40% cocoa)
Fresh Cream:about 100g
Walnuts:about 100g (or to your taste)
Rum:to taste





Procedure:


  1. Crumble the cake well into.... well crumbles.... Use very soft hands to do this.

  2. Mix the rum with the crumbled cake, but don't form a paste.

  3. Melt the chocolate (always over steam).

  4. Heat the cream in a saucepan till it is simmering and mix it well with the melted chocolate.

  5. Make small pieces of the walnuts and add them to the chocolate and cream mixure and mix well again.

  6. Empty the cake crumbs into the mixture above and mix till it forms a solid paste which can be rolled into balls. Note: You may need more/less of the cake crumbs depending on how watery/viscous the chocolate and cream mixture is.

  7. Roll them into balls!!!! You can put a thin layer of melted butter or ghee (clarified butter) on your palms so that the stuff doesn't stick to your palms while mixing.





Merry Christmas, and have a Great New Year!! :) :)

Gaajar Halwa

I met Appu and Abbas today, and we had this nice talk about carbs. and how the lesser there are around and within us, the better it is for our kind of lifestyle. I have steadily reduced the amount of sugar I have with my cup of tea or ukada (or ukado since it is a gujarati recipe) and am okay with it.
The recipe below is in complete defiance of the statements above. However in my defense, I used much less sugar than I usually would have.



Ingredients:

































Grated fresh red carrots(gaajar):about 1kg
Cow's Milk (preferably full fat containing ≥ 9% SNF):about 2ltr
Sugar:to your taste, but you could start off with 200g
Cow's Ghee (clarified butter):2-3 Tbsp
Elaichi:seeds of 4-5 pods, well separated
Elaichi powder:1 tsp (or to taste)





Procedure:


  1. Take the Ghee in a saucepan and add the elaichi seeds, and heat the ghee till it is hot, and just till you hear a pop sound.



  2. Add the grated carrots and coat every bit of them with the Ghee (on a slow flame).




  3. Heat the milk in a saucepan till it is hot.




  4. Pour some (about 500ml) of the hot milk into the carrot and ghee mixture, turn up the flame to medium and stir continuously.



  5. The milk will come to a boil.



  6. We want to make mawa (or khoa) from the 1.5ltr of remaining milk. You can find out how this is done. Basically it involves simmering the milk till the water almost vaporizes. The semi-solid mass that is left will be used later in the making of our halwa.


  7. Once the mawa is made, mix it with the rest of the carrot mixture (which should have considerably thickened by now. I hope you were continuously stirring it all this while!!).


  8. Also add the sugar. The sugar will immediately melt and give off a lot of water. We need to burn off all this water.



  9. Stir continuously till the water evaporates and we are left with a semi-solid mass (more liquidy though).



  10. Evenly sprinkle the elaichi powder over this mass and stir continuously.



  11. Till the mixture thickens to a consistency of your choice.



  12. Let it cool, and enjoy!!!! :)



btw, this is the 100th post on this blog!!