Monday, November 15, 2010

SQL Query optimization - Part-6 (How NATURAL JOINs work)

(The devil lies in the details edition)

To show how JOINing in databases works, we'll take up a very simple example and make some simplifying assumptions (which are true) about databases & indexes so that we can reason clearly about certain concepts and yet still understand the key insights required for comprehending how NATURAL JOINs work.

Let's assume that we have 2 tables X & Y that we want to join on a column named 'c'.
SELECT * FROM X JOIN Y ON X.c=Y.c WHERE <condition>

For the sake of simplicity, we shall draw parallels between the tables X & Y and 2 arrays AX & AY that are in-memory. Also let the size of the table X & Y be sA & sY. We shall consider a few cases and try to analyze how JOINing would work in these scenarios:
  1. No index on 'c' in either table: The DB is forced to perform a O(sX.sY) operation by running code that looks like a nested loop. Hence, this type of join is called a nested loop join. This is the most expensive type of join and must be avoided if the tables X & Y are very large.
  2. B-Tree index on 'c' only in table X: (This case is symmetrical to a B-Tree index being present on 'c' only in table Y -- but not really). This case is akin to one of the arrays being sorted and the other being unsorted. Here, the execution engine can loop over the table that doesn't have an index (the unsorted array in our analogy) and for each element, lookup the corresponding value in the table with the index (sorted array lookup using binary search). The cost for joining is O(sX.log sY) or O(sY.log sX). This is why which table has the index is important and the 2 cases aren't exactly symmetrical. It is advisable to have an index on the larger table -- to keep the linear component in the complexity as low as possible.
  3. Hash index on 'c' only in table X: (This case is symmetrical to a Hash index being present on 'c' only in table Y -- but not really). Again, this case is similar to the one above and depending up on whether table X or table Y has an index, the JOIN complexity will be either O(sY) or O(sX) respectively. Again, it is advisable to have an index on the larger table.
  4. Hash index on 'c' in both tables: This case isn't much different from the previous case except for the fact that the execution planner can decide whether to iterate over table X or table Y depending up on which table has a fewer number of rows.
  5. B-Tree index on 'c' in both tables: This is a very interesting and common case and is similar to the problem of trying to find the intersection of 2 sorted arrays AX & AY. It should be clear that the complexity of doing so is O(sX + sY). This bound is NOT tight and may be as low as theta(min(sX, sY)). Hence, it should be clear that some of the join strategies above can (in theory) be faster than this one. However, the execution planner may not always choose to use both indexes if it has heuristic data that suggests that iterating over one of the tables would be much cheaper (say because it contains only 2 rows). This join is most commonly referred to as the Merge Join because it looks like you are merging 2 arrays.

SQL Query optimization - Part-5 (Motivation)

(The better late than never edition)

I'm sure many of you have worked on web-apps (you know stuff like wikipedia, gmail, facebook, codechef, reddit, digg, duckduckgo or even your personal web-site which has some dymnamic/interactive content). I'm sure that out of these many, many of you have worked on sites that get traffic to the tune of a few hundred visitors per second. I'm sure you know how painstaking it is to run such a site and keep it up and running and responsive for all its users. Lots of things like the web-server, DB, file system, OS, etc... need to be tweaked and tuned and retro-fitted to suit the needs of the things around it.

The one place (in my experience) that can easily introduce bottlenecks is the DB (database). That's because it is something that is a real work-horse and is programmed by you (the application writer), who may not know what food it likes. It's very easy to write a query that runs 10x slower than what it really should be running and not notice this for a very long time.

Imagine the time you get slashdotted and need to support 5000 requests instead of the customary 50 requests/second. Don't you wish your DB doesn't just die due to the load?? Well, if you are one such person who is pedantic about the performance of your database, you are thinking right!

You will mostly learn how to optimize your DB through experimentation, but since others have been burned before, you might as well stay and take advantage of their experiences.

SQL Query optimization - Part-4 (Myths)

  1. If I have an index on every column of my table, all my queries will run fast. This is the equivalent of saying that adding all the good ingredients in a bowl and mixing away to glory will result in the world's best dish. Doesn't happen. The sooner you snap out of it, the better it is for you.
  2. Joins are expensive. Everything should be completely denormalized. This isn't an SQL optimization myth, just a SQL myth in general. Loosely speaking though, if your joins are all natural joins and the join columns are mostly unique and all your filter conditions are connected with a conjunction (AND clause) and you don't go crazy with GROUP BY, ORDER BY and HAVING, you will have made yourself a pretty fast query. The key to remember is to try and (anyhow) maximize the ratio of Rows returned to Rows visited by the DB engine to produce the result set.
  3. If I have an index on a column that I want to ORDER BY then that index will be used and the DB won't use a temporary table to sort the result set. Not really. If the JOIN is not on the first column that you want to ORDER BY, then the DB will most likely use a temporary table and sort. Having said that, even if you join on a column that you will later ORDER BY, the execution planner might decide to not order intermediate results by that column because it thinks that that would be the most efficient query execution plan. This can happen when exactly one table in the JOIN doesn't have an index on the join column and that will be the column that the DB will decide to iterate over.
  4. Streaming result sets. I really don't know how many DBs actually do this, but it is a killer feature to have. Of course, you need to craft your queries very smartly to take advantage of this.

I you've gotten this far and don't feel lost, you are into this shit big time, and it's time to read some stuff by the big boys (here and here).

Sunday, November 14, 2010

SQL Query optimization - Part-3

Now for some trivial stuff:
  1. Make sure that every column you query on has an (appropriate) INDEX. Creating a full-text index on something that you will do exact matching won't help. Neither will creating a HASH index on a column on which you intend doing < and > comparisons
  2. Avoid complex mathematical expressions in your WHERE clauses if you want indexes to be used. Databases are pretty dumb that way and practice and theory is still some way off in this area (which is why I want to bridge this gap!!). Don't do stuff like: WHERE 2*col1+3 < col2. This will probably NOT be optimized by the execution engine. Of course, if there is no other way out, then you must do it. I suggest you create another column (col3) which always gets the value: 2*col1+3, and query using col3 whenever you need this expression
  3. (This is an ugly condition and probably won't be optimized). Suppose you have 2 columns (col1 & col2) on the same table and want to select all rows WHERE col1 < col2. This will most likely result in a full table scan even if all columns have an index on them. Try to create a new column to flag this condition and select on that column if you need speed. Update: I can definitely see a way in which databases could use an index on (col1, col2) or (col2, col1) (if one exists), but you'll have to check with your particular database if that is the case (by doing an EXPLAIN or its equivalent).

I just noticed

This has been my most active year of blogging. 36 posts including this one till now. (see the bar on the right). Which probably means that I haven't been working as much this year as compared to the last few years.

SQL Query optimization - Part-2

Today we'll talk about whether to use JOINs or nested queries. TDDB (my db engine :D) doesn't support nested queries so I don't talk about them much, but that's just selfish of me (actually even MySQL didn't support them for a long time), but I've grown out of that, so here it goes. Nested queries can actually speed up your query by a HUGE amount.

Consider you have say 6 tables which you are joining and only 2 of them are actually involved in any logical reasoning like generating unique rows. The rest of the tables are vestiges of an over-enthusiastic DB architect or a graduate fresh out of college who thinks that everything should be in BCNF (been there; done that).

Now the other 4 tables are being used just to provide data for the purpose of projection and really have no reason being part of the JOIN. Hence, we refactor the query and create a temporary relation from the 2 tables that actually mean something. Most real queries are interested in the top 10 or 20 results only (ordered by some criteria) so we do whatever we want to on the join of these 2 tables.

This temporary relation is then joined with the other 4 relations and viola! This recipe serves 100 queries for the cost of 1 ;)

This is possible since all the extraneous tables have been avoided in the join that involved ordering, grouping and potentially many other expensive operations.

This can be optimized even more by using a materialized view for the inner query. However, that will increase the insertion time for those tables, and should be used with care. Besides, I don't think all DB engines support materialized views.

SQL Query optimization - Part-1

I'm planning to write some random posts on query optimization. Here's the first one. I'll get straight to the point 'cause I don't like beating about the bush (well not usually), so here it goes.

This post will mainly talk about optimizing queries that use the LIKE keyword.
  1. Try to avoid it if you can. Databases are generally bad at optimizing LIKE queries no matter how good they may be. More often than not, a LIKE query will result in a full-table-scan
  2. If your LIKE query looks something like this: colname LIKE 'something%' then there is some good news for you. You can convert this to colname ≤ 'something' AND colname ≥ 'somethingz' (I'm assuming that colname contains only ASCII text. If it contains unicode, use the last UNICODE character in place of 'z')
  3. No matter how hard you pray, colname LIKE '%something' will NOT be optimized by your favoirite DB engine. You will HAVE to reverse the contents of colname, store them in colname_reversed and query on that as colname_reversed LIKE 'gnihtemos%'
  4. colname LIKE '%something%' is just asking for trouble
  5. And so is stuff like LOWER(colname) = 'something'. This is because anything that transforms the column data can NOT use the index since the index indexes the ORIGINAL value, not the transformed value
  6. If you are using MySQL, the LIKE operator is case-insensitive so you can always replace all statements that look like LOWER(colname) = 'something' with colname LIKE 'something' and it will always work. Silly but useful
  7. To get a more detailed understanding of how indexes work, read up on B-Trees

Undecidable

Many times have I come across concepts that mimic the concept of an Undecidable problem. So, for all you people who don't know what an Undecidable (or partially decidable -- please pardon the terminoloty) problem is, you can check out the wiki page here or read the layman's definition below:

If a solution to the problem has a "yes" answer then it WILL be found, but if it has a "NO" answer, you'll never know so because you'll keep running the algorithm in the hope of finding the "YES" answer.

Now, it turns out that in the last few months I've come across many instances of such "undecidable" problems. I'll list out a few here:
  1. If you have good marks on a test, you are a good student, but if you don't then it doesn't mean that you are a bad student
  2. If you have lots of fans on your facebook page, then you are famous but if you don't, it doesn't necessarily mean that you aren't

So, the next time I have an argument/discussion with anyone and I mention "this is undecidable", I'll try to give them a link to this post as well.

Friday, October 29, 2010

Why Mongrel2 Rocks

I used Mongrel2 for my latest project lib-face and it was actually quite a pleasure to work with. Without wasting any more of your time, here are the reasons:
  1. It's fast
  2. It's simple to install and use. Though getting all dependencies is a bit time consuming
  3. It's very well documented
  4. The documentation isn't outdated
  5. I mailed the author asking for some clarifications and got a reply pretty fast. This means that if I get stuck using it, there is someone out there to help me till a community develops
  6. It gives out meaningful error messages that actually help
  7. Writing aync handling code is very very simple because you need to explicitly call a function to return. This means that you can hold on to a request for as long as you want
  8. Mongrel2 uses ZeroMQ as the transport. This means that I can potentially implement some other (non-HTTP) protocol and have my code handle requests for that protocol since it is now only talking to ZeroMQ
  9. It gives out meaningful error messages that actually help
  10. It's simple yet powerful
  11. It supports a lot of languages -- remember it's language agnostic!!

Wednesday, September 15, 2010

A very fast approach to auto complete (or search suggestions)

You've seen search engines suggest queries when you begin typing the first few letters of your search string. This is being done by Duck Duck Go as well as Google (to name a few). This is typically done by maintaining a list of past queries and/or important strings that the search engine thinks are worthy of being suggested to a user that is trying to find something similar. These suggestions are effective only if the search engine spits them out very fast since these should show up on the screen before the user has finished typing what he/she wanted to type. Hence the speed with which these suggestions are made is very critical to the usefulness of this feature.

Let us consider a situation (and a possible way of approaching this problem) in which when a user enters the first few letters of a search query, he/she is presented with some suggestions that have as their prefix, the string that the user has typed. Furthermore, these suggestions should be ordered by some score that is associated with each such suggestion.

Approach-1:

Our first attempt at solving this would probably involve keeping the initial list of suggestions sorted in lexicographic order so that a simple binary search can give us the 2 ends of the list of strings that serve as candidate suggestions. These are all the strings that have the user's search query as a prefix. We now need to sort all these candidates by their associated score in non-increasing order and return the first 6 (say). We will always return a very small subset (say 6) of the candidates because it is not feasible to show all candidates since the user's screen is of bounded size and we don't want to overload the user with too many options. The user will get better suggestions as he/she types in more letters into the query input box.

We immediately notice that if the candidate list (for small query prefixes say of length 3) is large (a few thousand), then we will be spending a lot of time sorting these candidates by their associated score. The cost of sorting is O(n log n) since the candidate list may be as large as the original list in the worst case. Hence, this is the total cost of the approch. Apache's solr uses this approach. Even if we keep the scores bound within a certain range and use bucket sort, the cost is still going to be O(n). We should definitely try to do better than this.


Approach-2:

One way of speeding things up is to use a Trie and store (pointers or references to) the top 6 suggestions at or below that node in the node itself. This idea is mentioned here. This results in O(m) query time, where m is the length of the prefix (or user's search query).

However, this results in too much wasted space because:
  1. Tries are wasteful of space and
  2. You need to store (pointers or references to) 6 suggestions at each node which results in a lot of redundancy of data

We can mitigate (1) by using Radix(or Patricia) Trees instead of Tries.


Approach-3:

There are also other approaches to auto-completion such as prefix expansion that are using by systems such as redis. However, these approaches use up memory proportional to the square of the size of each suggestion (string). The easy way to get around this is to store all the suggestions as a linear string (buffer) and represent each suggestion as an (index,offset) pair into that buffer. For example, suppose you have the strings:
  1. hello world
  2. hell breaks lose
  3. whiteboard
  4. blackboard

Then your buffer would look like this:
hello worldhell breaks losewhiteboardblackboard
And the 4 strings are actually represented as:
(0,11), (11,16), (27,10), (37,10)

Similarly, each prefix of the suggestion "whiteboard" is:
  1. w => (27,1)
  2. wh => (27,2)
  3. whi => (27,3)
  4. whit => (27,4)
  5. white => (27,5)
  6. whiteb => (27,6)
  7. and so on...

Approach-4:

We can do better by using Segment (or Interval) Trees. The idea is to keep the suggestions sorted (as in approach-1), but have an additional data structure called a Segment Tree which allows us to perform range searches very quickly. You can query a range of elements in Segment Tree very efficiently. Typically queries such as min, max, sum, etc... on ranges in a segment tree can be answered in O(log n) where n is the number of leaf nodes in the Segment Tree. So, once we have the 2 ends of the candidate list, we perform a range search to get the element with the highest score in the list of candidates. Once we get this node, we insert this range (with the maximum score in that range as the key) into the priority queue. The top element in the queue is popped and split at the location where the element with the highest score occurs and the scores for the 2 resulting ranges are computed and pushed back into the priority queue. This continues till we have popped 6 elements from the priority queue. It is easy to see that we will have never considered more than 2k ranges (here k = 6).

Hence, the complexity of the whole process is the sum of:
  1. The complexity for the range calculation: O(log n) (omitting prefix match cost) and
  2. The complexity for a range search on a Segment Tree performed 2k times: O(2k log n) (since the candidate list can be at most 'n' in length)

Giving a total complexity of:
O(log n) + O(2k log n)
which is: O(k log n)

Update (29th October, 2010): I have implemented the approach described above in the form of a stand-alone auto-complete server using Pyhton and Mongrel2. You can download it from here. (11th February, 2012): lib-face is now called cpp-libface and has a new home at github!

Some links I found online:
LingPipe - Text Processing and Computationsal Linguistics Tool Kit
What is the best open source solution for implementing fast auto-complete?
Suggest Trees (SourceForge): Though these look promising, treaps in practice can degenerate into a linear list.
Autocomplete Data Structures
Why is the autocomplete for Quora so fast?
Autocomplete at Wikipedia
How to implement a simple auto-complete functionality? (Stack Overflow)
Auto Compete Server (google code)
Three Autocomplete implementations compared
Trie Data Structure for Autocomplete?
What is the best autocomplete/suggest algorithm,datastructure [C++/C] (Stack Overflow)
Efficient auto-complete with a ternary search tree

Wednesday, August 18, 2010

What you've always wanted to know about GRE preparation

When I was preparing for the General GRE, it was a completely mad rush, with me having to ask anyone and everyone for help and advice. So, the conglomeration of all that information is what guided me throughout. Today, I am going to put that all down (combined with my own experiences) so that others don't have to grapple with the same questions and uncertainties and can really concentrate on studying for the exam.
  1. Don't fall into the trap of not booking a date!! You will find it hard to study if you haven't booked a date
  2. Book a date sufficiently into the future (say 2-3 months) and work towards it as if the date can not be postponed. I've seen many people postpone the date just because they can. Don't fall into that trap
  3. I used the Barron's book for everything
  4. Don't take the quant section lightly. Even though it is easy, it is time critical and there are many trick questions
  5. Even though the word list will seem daunting at first, keep at at and try to do as much of it as you can. I must have landed up knowing close to 3000 (out of the 3500) words in the Barron's word list
  6. Attack the High Frequency Words first (about 350). Make sure you use the list from both Barron and Kaplan. This file contains all such words and words that I found hard. Use the Barron's and Kaplan's list from here and make you own list of "hard" words (everyone will find a different set hard)
  7. Then do the words list-by-list. There are about 50 word lists in the Barron's book. Each word list can take you anywhere from 1 to 3 hours to do if you are doing it the first time. Towards the end, you should be able to do all the word lists in 2-3 days (assuming 5-6 working hours each day)
  8. Solve as many exercise problems in the Barron's book as you can in the quant section
  9. For quant practice, use The Nova GRE prep. book. The quant in this book is easier than what you will get in the GRE, so use this as your first round of preparation
  10. Take the 5 sample tests given at the end of the Barron's book. They are very close to the actual GRE test in terms of difficulty
  11. The Kaplan tests too are of GRE difficulty level (I would say a tiny bit more difficult, but that shouldn't be something you should factor in). The 6 practice exercises in Kaplan are way too easy (for quant) and the 30 min you get to solve them is sufficient. The quant in the 3 sample tests is harder and is of GRE difficulty level
  12. POWERPREP (the official GRE prep. software) is something you must do before the test. They give you 2 sample tests. Take them 1-3 days before the real test when you are sudo prepared for the test. These will give you a great indicator of what score you might land up getting. If you do badly in some section, try to spruce up your act in that section. The drawback of taking it so late (just before the test) is that you don't give yourself enough time to rectify your weak areas. However, I wouldn't want anyone to "waste" these tests before they are fully prepared for the simple reason that these tests are really good indicators of how much you might land up getting
  13. None of the other tests I took were of the GRE difficulty level, so make sure that you take all of the tests above before you take the GRE
  14. On the day of the test, make sure you arrive at the test center on time and have all the required documents
  15. That's about it and all the best!!

Wednesday, August 11, 2010

How things go wrong (reprise)

About 3 years ago, I wrote a post on How things go wrong. I should have made everyone I know read it since we just ran into an issue that bit us in the ass... Yes, it was caused as a result of:
  • Lack of return value checking from system calls and
  • Lack of sufficient error checking/handling code
and further complicated by a lack of logging in the application.

I can't begin saying how important each of them are. Oh well, I only have 2 things to say:
  • You need to learn for yourself
  • You can never be too careful
  • You can't learn from other's mistakes (yes, this and the 1st one are the same thing)

Thursday, August 05, 2010

Using Titanium Desktop to test web sites

I have been using Titanium Desktop for a while now at work, and am not at all impressed with the implementation or stability of the product and wouldn't want to see it being used in any serious project. HOWEVER, it is a brilliant concept which if done well can be humongously successful.

I was faced with the daunting task of testing a web-site written in Drupal the other day. Now, I don't know Drupal but have been told that its got these magic form elements which it used for validation. That couples with cookies meant that I would have to do a lot of work to test even slightly complex flows using python/perl/ruby::Mechanize. That apart, I can NEVER simulate the javascript and other asset loads that a normal user would exercise. For complete end-to-end-testing, I couldn't find anything that I would like using.

Here is where Titanium comes in. It doesn't have the sandboxing rules that normal browsers have, so you can load pretty much any page in a Titanium IFrame and do pretty much anything you want with it!!!!

I have given here an example of using Titanium to search for some text on Duck Duck Go.

To run the example, start Titanium Desktop as kboot --query "Your Search String"

Enjoy!!

Update: Here is a video showing how the automation works.

Saturday, July 03, 2010

Aaloo, mutter and tamater sabzi

This is the recipe for my current favourite vegetable. Translated to english, it means Potato, peas and tomatoes. (makes sabzi to serve 4). Dedicated to my mom since she makes this dish in the most awesome way I have ever tasted it to be and to sandy since he is going to learn it this time ;)



Ingredients:

Medium sized aaloo(potatoes); boiled: about 3 nos
Fresh green peas: about 100g
Medium sized tomates: about 4 nos.
Clarified Butter (Ghee): 1 Tbsp
Hing (asafoetida): 1/2 tsp
Cumin seeds (jeera): 1 1/2 tsp
Turmeric (haldi): 1/4 tsp
Salt (namak): to taste
Red chilli powder (this is mainly used for the red color): 1/4 tsp (or to taste)
Fresh green chilli (cut and seeded): 1/2 nos.
Dhania and jeera powder: 1 tsp
Fresh corriander leaves: (to garnish)




Procedure:
  1. You need not skin the boiled potatoes. Just cut it into about 1/2" X 1/2" cubes
  2. Heat the ghee and when it is hot, add the hing, jeera and green chilli to it. When the jeera starts popping, and water to it
  3. Add peas to the water, and boil them till they are soft
  4. Now, cut the tomatoes (unskinned) into small pieces and add it to the boiling water and peas. Keep boiling till the tomatoes have softened
  5. Now, add the cut potatoes into the mixture and add red chilli powder, dhania and jeera powder, turmeric and salt
  6. Boil for about 2 more minutes till the potatoes have soaked the gravy
  7. Garnish with fresh corriander leaves and cover or serve

Monday, June 28, 2010

Actions speak...

... louder than words

Thursday, June 24, 2010

Guess what?

            _________
            |*      |
            |       |
    (------------   |-------)
    (               |       )
    (       |--------       )
    (       |               )
     -------|   ------------
            |       |
            |      *|
            ---------

Or this (if you are having a problem with the one above ;))
            _________
            I *     |
            I       |
    (------------   |-------)
    (               |       )
    (       |--------       )
    (       |               )
     -------|   ------------
            |       I
            |       I
            |_____*_I


+5 to the first one who guesses correctly

Murphy's Law (for Python programmers)

Download it here
Please rename the file to murphy.py and from your interpreter:
import murphy

Wednesday, June 23, 2010

YASNOSC

Yet another SQL and NoSQL comparison
  1. Scalability != NoSQL

  2. Performance != NoSQL [not always]

  3. Scalability == Good schema design

  4. Performance == Good schema design

But...
  1. Performance != Scalability

  2. Scalability != Performance

(but we should also note that NoSQL isn't NOt SQL, but Not Only SQL)

Tuesday, June 22, 2010

A new water pump for our washing machine

(This blog post has the potential to be discursive and go in no particular direction, so I'll try my best to remain as unwavering as I can)

It all started off when our washing machine konked off and decided to stop working. My grandma sent it off to the repairers who fixed it and then had the brilliant idea to suggest attaching a water pump so that water from the tank above would flow in much faster. Now, anyone who has played with a syringe would know that no matter how hard you push (or pull), you just can NOT for the love of anything (or anyone) get more than a certain amount of liquid through the syringe (or in this case the pipe).

Now, our pipes are really old and are rusting from the inside. This is the reason that the flow of water has gone down as compared to the time when they were first installed. Even though I suggested getting the pipes changed due to the above mentioned reason (since that is the real problem, my grandma would hear nothing from me... Oh well... so be it...)

The brilliant engineer's suggestion came as a massive shock to me and it almost felt funny. Anyways, the pump has been installed and guess what... The flow of water in the pipe is still the same!! I wonder how that happened. Isn't the pump working (scratch head, etc...)

The rate of flow of water in a pipe depends upon many things. One of them being the inner diameter of the pipe and another being the friction offered by the sides of the pipe. I don't know the exactly formula, but I am guessing it is easy to come by. The pressure exerted by the water in the tank is more than sufficient to max. out the flow rate in the pipe and I would have guessed those engineers to know that. I guess it's back to the drawing board for them ;)

Friday, June 11, 2010

How to write bug-free code

Listen hard and listen fast. Listen once; I won't repeat myself...

While writing code, if you ever say to yourself:
  • Oh! but this will hardly ever happen OR

  • The user will never do this OR

  • What are the odds of this happening??

Please please please go ahead and write some code and comments to handle that situation. You can't ever be too careful.

Signing off,
Captain S.K.Eptic