Some thoughts on your (and my) thoughts.: 11/14/2010

Tuesday, November 16, 2010

SQL Query optimization - Part-7

(The you need referential integrity for more than referential integrity edition)

Why do we tag our blog posts with well, tags, so that people can glean meaningful information from these terse blocks of text. A post tagged with vacation or tutorial says so much about the blog post. Similarly, hinting the DB that a certain column has a similar looking entry in some other table means a lot to the execution planner. Yes, I'm talking about Foreign Key references.

If it finds such a hint, it can infer that:

Every value in this column has some unique value in the column that it refers to.
Since it is pointing to a UNIQUE column, joining a column from this table will match exactly 1 row from the table to which it refers to. Hence, while NATURAL JOINing the 2 tables, the execution engine can stop when it finds one match.
A Foreign Key constraint forces you to make the referenced column UNIQUE. This means that a WHERE clause on that column will never match more than one row and the optimizer can take advantage of that. If such a query occurs as a subquery, then the optimizer can evaluate the sub-query first and replace the sub-query with a single constant which is the result of the inner query's execution.

All these optimizations can significantly reduce your query's execution time.

Monday, November 15, 2010

SQL Query optimization - Part-6 (How NATURAL JOINs work)

(The devil lies in the details edition)

To show how JOINing in databases works, we'll take up a very simple example and make some simplifying assumptions (which are true) about databases & indexes so that we can reason clearly about certain concepts and yet still understand the key insights required for comprehending how NATURAL JOINs work.

Let's assume that we have 2 tables X & Y that we want to join on a column named 'c'.
SELECT * FROM X JOIN Y ON X.c=Y.c WHERE <condition>

For the sake of simplicity, we shall draw parallels between the tables X & Y and 2 arrays AX & AY that are in-memory. Also let the size of the table X & Y be sA & sY. We shall consider a few cases and try to analyze how JOINing would work in these scenarios:

No index on 'c' in either table: The DB is forced to perform a O(sX.sY) operation by running code that looks like a nested loop. Hence, this type of join is called a nested loop join. This is the most expensive type of join and must be avoided if the tables X & Y are very large.
B-Tree index on 'c' only in table X: (This case is symmetrical to a B-Tree index being present on 'c' only in table Y -- but not really). This case is akin to one of the arrays being sorted and the other being unsorted. Here, the execution engine can loop over the table that doesn't have an index (the unsorted array in our analogy) and for each element, lookup the corresponding value in the table with the index (sorted array lookup using binary search). The cost for joining is O(sX.log sY) or O(sY.log sX). This is why which table has the index is important and the 2 cases aren't exactly symmetrical. It is advisable to have an index on the larger table -- to keep the linear component in the complexity as low as possible.
Hash index on 'c' only in table X: (This case is symmetrical to a Hash index being present on 'c' only in table Y -- but not really). Again, this case is similar to the one above and depending up on whether table X or table Y has an index, the JOIN complexity will be either O(sY) or O(sX) respectively. Again, it is advisable to have an index on the larger table.
Hash index on 'c' in both tables: This case isn't much different from the previous case except for the fact that the execution planner can decide whether to iterate over table X or table Y depending up on which table has a fewer number of rows.
B-Tree index on 'c' in both tables: This is a very interesting and common case and is similar to the problem of trying to find the intersection of 2 sorted arrays AX & AY. It should be clear that the complexity of doing so is O(sX + sY). This bound is NOT tight and may be as low as theta(min(sX, sY)). Hence, it should be clear that some of the join strategies above can (in theory) be faster than this one. However, the execution planner may not always choose to use both indexes if it has heuristic data that suggests that iterating over one of the tables would be much cheaper (say because it contains only 2 rows). This join is most commonly referred to as the Merge Join because it looks like you are merging 2 arrays.

SQL Query optimization - Part-5 (Motivation)

(The better late than never edition)

I'm sure many of you have worked on web-apps (you know stuff like wikipedia, gmail, facebook, codechef, reddit, digg, duckduckgo or even your personal web-site which has some dymnamic/interactive content). I'm sure that out of these many, many of you have worked on sites that get traffic to the tune of a few hundred visitors per second. I'm sure you know how painstaking it is to run such a site and keep it up and running and responsive for all its users. Lots of things like the web-server, DB, file system, OS, etc... need to be tweaked and tuned and retro-fitted to suit the needs of the things around it.

The one place (in my experience) that can easily introduce bottlenecks is the DB (database). That's because it is something that is a real work-horse and is programmed by you (the application writer), who may not know what food it likes. It's very easy to write a query that runs 10x slower than what it really should be running and not notice this for a very long time.

Imagine the time you get slashdotted and need to support 5000 requests instead of the customary 50 requests/second. Don't you wish your DB doesn't just die due to the load?? Well, if you are one such person who is pedantic about the performance of your database, you are thinking right!

You will mostly learn how to optimize your DB through experimentation, but since others have been burned before, you might as well stay and take advantage of their experiences.

SQL Query optimization - Part-4 (Myths)

If I have an index on every column of my table, all my queries will run fast. This is the equivalent of saying that adding all the good ingredients in a bowl and mixing away to glory will result in the world's best dish. Doesn't happen. The sooner you snap out of it, the better it is for you.
Joins are expensive. Everything should be completely denormalized. This isn't an SQL optimization myth, just a SQL myth in general. Loosely speaking though, if your joins are all natural joins and the join columns are mostly unique and all your filter conditions are connected with a conjunction (AND clause) and you don't go crazy with GROUP BY, ORDER BY and HAVING, you will have made yourself a pretty fast query. The key to remember is to try and (anyhow) maximize the ratio of Rows returned to Rows visited by the DB engine to produce the result set.
If I have an index on a column that I want to ORDER BY then that index will be used and the DB won't use a temporary table to sort the result set. Not really. If the JOIN is not on the first column that you want to ORDER BY, then the DB will most likely use a temporary table and sort. Having said that, even if you join on a column that you will later ORDER BY, the execution planner might decide to not order intermediate results by that column because it thinks that that would be the most efficient query execution plan. This can happen when exactly one table in the JOIN doesn't have an index on the join column and that will be the column that the DB will decide to iterate over.
Streaming result sets. I really don't know how many DBs actually do this, but it is a killer feature to have. Of course, you need to craft your queries very smartly to take advantage of this.

I you've gotten this far and don't feel lost, you are into this shit big time, and it's time to read some stuff by the big boys (here and here).

Sunday, November 14, 2010

SQL Query optimization - Part-3

Now for some trivial stuff:

Make sure that every column you query on has an (appropriate) INDEX. Creating a full-text index on something that you will do exact matching won't help. Neither will creating a HASH index on a column on which you intend doing < and > comparisons
Avoid complex mathematical expressions in your WHERE clauses if you want indexes to be used. Databases are pretty dumb that way and practice and theory is still some way off in this area (which is why I want to bridge this gap!!). Don't do stuff like: WHERE 2*col1+3 < col2. This will probably NOT be optimized by the execution engine. Of course, if there is no other way out, then you must do it. I suggest you create another column (col3) which always gets the value: 2*col1+3, and query using col3 whenever you need this expression
(This is an ugly condition and probably won't be optimized). Suppose you have 2 columns (col1 & col2) on the same table and want to select all rows WHERE col1 < col2. This will most likely result in a full table scan even if all columns have an index on them. Try to create a new column to flag this condition and select on that column if you need speed. Update: I can definitely see a way in which databases could use an index on (col1, col2) or (col2, col1) (if one exists), but you'll have to check with your particular database if that is the case (by doing an EXPLAIN or its equivalent).

I just noticed

This has been my most active year of blogging. 36 posts including this one till now. (see the bar on the right). Which probably means that I haven't been working as much this year as compared to the last few years.

SQL Query optimization - Part-2

Today we'll talk about whether to use JOINs or nested queries. TDDB (my db engine :D) doesn't support nested queries so I don't talk about them much, but that's just selfish of me (actually even MySQL didn't support them for a long time), but I've grown out of that, so here it goes. Nested queries can actually speed up your query by a HUGE amount.

Consider you have say 6 tables which you are joining and only 2 of them are actually involved in any logical reasoning like generating unique rows. The rest of the tables are vestiges of an over-enthusiastic DB architect or a graduate fresh out of college who thinks that everything should be in BCNF (been there; done that).

Now the other 4 tables are being used just to provide data for the purpose of projection and really have no reason being part of the JOIN. Hence, we refactor the query and create a temporary relation from the 2 tables that actually mean something. Most real queries are interested in the top 10 or 20 results only (ordered by some criteria) so we do whatever we want to on the join of these 2 tables.

This temporary relation is then joined with the other 4 relations and viola! This recipe serves 100 queries for the cost of 1 ;)

This is possible since all the extraneous tables have been avoided in the join that involved ordering, grouping and potentially many other expensive operations.

This can be optimized even more by using a materialized view for the inner query. However, that will increase the insertion time for those tables, and should be used with care. Besides, I don't think all DB engines support materialized views.

SQL Query optimization - Part-1

I'm planning to write some random posts on query optimization. Here's the first one. I'll get straight to the point 'cause I don't like beating about the bush (well not usually), so here it goes.

This post will mainly talk about optimizing queries that use the LIKE keyword.

Try to avoid it if you can. Databases are generally bad at optimizing LIKE queries no matter how good they may be. More often than not, a LIKE query will result in a full-table-scan
If your LIKE query looks something like this: colname LIKE 'something%' then there is some good news for you. You can convert this to colname ≤ 'something' AND colname ≥ 'somethingz' (I'm assuming that colname contains only ASCII text. If it contains unicode, use the last UNICODE character in place of 'z')
No matter how hard you pray, colname LIKE '%something' will NOT be optimized by your favoirite DB engine. You will HAVE to reverse the contents of colname, store them in colname_reversed and query on that as colname_reversed LIKE 'gnihtemos%'
colname LIKE '%something%' is just asking for trouble
And so is stuff like LOWER(colname) = 'something'. This is because anything that transforms the column data can NOT use the index since the index indexes the ORIGINAL value, not the transformed value
If you are using MySQL, the LIKE operator is case-insensitive so you can always replace all statements that look like LOWER(colname) = 'something' with colname LIKE 'something' and it will always work. Silly but useful
To get a more detailed understanding of how indexes work, read up on B-Trees

Undecidable

Many times have I come across concepts that mimic the concept of an Undecidable problem. So, for all you people who don't know what an Undecidable (or partially decidable -- please pardon the terminoloty) problem is, you can check out the wiki page here or read the layman's definition below:

If a solution to the problem has a "yes" answer then it WILL be found, but if it has a "NO" answer, you'll never know so because you'll keep running the algorithm in the hope of finding the "YES" answer.

Now, it turns out that in the last few months I've come across many instances of such "undecidable" problems. I'll list out a few here:

If you have good marks on a test, you are a good student, but if you don't then it doesn't mean that you are a bad student
If you have lots of fans on your facebook page, then you are famous but if you don't, it doesn't necessarily mean that you aren't

So, the next time I have an argument/discussion with anyone and I mention "this is undecidable", I'll try to give them a link to this post as well.

Some thoughts on your (and my) thoughts.