Thursday, August 02, 2012

SQL Query Execution Optimization opportunities missed: Part-1

There are some common queries that are used almost everywhere, but they seem to be not executed very cleverly by most (all??) SQL execution engines.

An example of one such query is the quintessential pagination query. Assume you have a schema for comments that looks like this:


CREATE TABLE COMMENTS(post_id INT, 
  comment_id INT, 
  comment_text TEXT,
  poster_id INT,
  posted_at TIMESTAMP,
  PRIMARY KEY (comment_id),
  UNIQUE KEY (post_id, posted_at, comment_id)
);


Now, you display the first page (say the most recent 10 comments) for a post (with id=33) using a query such as:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 0,10;


The LIMIT 0,10 syntax fetches at most 10 comments from offset 0 in the result set.

The most natural way to fetch the 2nd page would be to say:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 10,10;


Similarly, you can get the 3rd page by saying:

SELECT * FROM COMMENTS 
  WHERE post_id=33 
  ORDER BY posted_at DESC
  LIMIT 20,10;


and so on...

You must have noticed that in case LIMIT 1000,10 is specified, a naive execution of the query plan involves fetching all the 1000 rows before the 10 interesting rows are returned by the database. In fact, most databases, will do just that. There is in fact a better way to execute such queries.

(Let's leave aside the discussion of whether pagination queries should be implemented this way. The observant reader will notice that we have an index on (post_id, posted_at, comment_id), and that can be used to fetch the next 10 results given the post_id, posted_at, and comment_id of the immediately preceding comment).

We know that most (of not all) databases use B-Trees to store indexed information. It is easy to augment the B-Tree to hold information such as "how many elements under this node (including the node itself)?" This information alone will let us very quickly (in O(log n) I/O look-ups) locate the node of interest. Suppose we want the entry from the COMMENTS table at offset 1000 with post_id=33, we will first perform a look-up for the first entry with post_id=33. Once we find this key, we can quickly (in O(1) time) determine how many entries are less than the located entry (since we already have this information cached at each node). Let the found node be at offset OFF1. Subsequently, we can query the B-Tree to find the entry that has an offset of OFF1 + 1000 (since we have the cumulative counts cached for every value in the B-Tree node!).

For example, consider the Binary Tree shown below:



The alphabets in the top half of every node are the keys and the number in the bottom half of the node is the count of the number of nodes in the sub-tree rooted at that node.

We can answer the following query on this tree in O(log n) time:

SELECT KEY FROM BINARY_TREE
  WHERE KEY > D
  ORDER BY KEY
  LIMIT 3,2;


i.e. We want to fetch the 5 smallest keys greater than 'D', but we want only the last 2 from this set. i.e. The 2 greatest keys that are a subset of the set containing the 5 smallest keys greater than 'D' (read that a few times to get a hang of it).

We proceed as follows:
  1. Find D
  2. Find Rank(D) = 4
  3. Find the KEY such that Rank(K) = Rank(D)+3 = 7. This happens to be the KEY 'G'
  4. Perform an in-order traversal to find the in-order successor of the node that we are on till we either run out of nodes or we have walked 2 nodes (whichever comes first)
Note: To compute the rank of a node we use the following formula:

Rank(Node) = SubtreeSize(Node->Left) + 1 [if Node->Left exists]
Rank(Node) = 1                           [otherwise]


Hence, we can see that we are able to satisfy this query in time O(log n + Result_Set_Size), where n is the number of elements in the Binary Tree.

2 comments:

nomind said...

You know of any DB which provide these features with their indexes or is it a new proposal. Is this what you have doing in Fb, if of course it is not confidential :P.

BTW is it not true that MySQL creates a temporary table for order by clause? What about other Db's?

Dhruv Matani said...

> You know of any DB which provide these features
> with their indexes or is it a new proposal.

This is a new proposal.

> Is this what you have doing in Fb, if of course it
> is not confidential :P.

All opinions mentioned on this blog are my own, and not of my employers. Further, I can neither confirm nor deny that :-p

> BTW is it not true that MySQL creates a temporary table for
> order by clause? What about other Db's?

MySql is quite a dumb SQL execution engine, and is not able to perform many optimizations. For example, if you do a group by with an order by (even on a proper prefix of the same field), it will use filesort and a temporary even thought it doesn't need to if the column you are trying to sort & group by is already indexed. See this page on how MySql uses internal temporary tables for an enumerated set of situations where MySql may use a temporary table.