Saturday, July 7, 2012

From Months to Seconds with Subquery Materialization


In an earlier blog post, I showed how optimizer improvements in MySQL 5.6 gave better performance for several of the queries in the DBT-3 benchmark.
However, for one of the queries, Query 18, I was not able to give exact numbers for the improvement since the query took very long in MySQL 5.5. I decided to try to find out exactly how long the query would take, but when the query had run for one month, I gave up. How can a query take so long? Especially, when I had set up InnoDB with a buffer pool that should be large enough to hold the entire database. Let's have a look at the query:

select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
from customer, orders, lineitem
where o_orderkey in (
                select l_orderkey
                from lineitem
                group by l_orderkey
                having sum(l_quantity) > 313
  )
  and c_custkey = o_custkey
  and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate
LIMIT 100;

This query will find the orders from customers that have placed big orders. The reason that this takes so long in MySQL 5.5, is that the subquery in the WHERE clause will be executed for each processed row of the table for which this subquery is part of the WHERE predicate. Let's look at the EXPLAIN output for this query:
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type        | table    | type  | possible_keys                              | key                   | key_len | ref                     | rows    | Extra                           |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
|  1 | PRIMARY            | customer | ALL   | PRIMARY                                    | NULL                  | NULL    | NULL                    |  150000 | Using temporary; Using filesort | 
|  1 | PRIMARY            | orders   | ref   | PRIMARY,i_o_custkey                        | i_o_custkey           | 5       | dbt3.customer.c_custkey |       7 | Using where                     | 
|  1 | PRIMARY            | lineitem | ref   | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4       | dbt3.orders.o_orderkey  |       2 | Using index                     | 
|  2 | DEPENDENT SUBQUERY | lineitem | index | NULL                                       | PRIMARY               | 8       | NULL                    | 6001215 | NULL                            | 
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The select_type for the subquery is DEPENDENT SUBQUERY. Since the left hand side of the IN-expression is a field from the orders table, the subquery will be executed for each row processed from this table. This implies that the index scan of the lineitem table will be performed more than one million times given the rows estimates in the EXPLAIN output (150000*7). I have measured that one execution of the sub-query takes about 3.5 seconds when all the data is in memory. Hence, it will take more than 40 days to execute it one million times.

In MySQL 5.6, Subquery Materialization may be used to avoid the repeated execution of subqueries. This implies that the subquery is executed once and the result stored (materialized) in a temporary table. Then, for each row where the subquery was earlier executed, a hash-based look-up into the temporary table will be made instead to check whether there is a match. The EXPLAIN output for Query 18, looks like this in MySQL 5.6:

+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table    | type  | possible_keys                              | key                   | key_len | ref                     | rows    | Extra                           |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
|  1 | PRIMARY     | customer | ALL   | PRIMARY                                    | NULL                  | NULL    | NULL                    |  150000 | Using temporary; Using filesort | 
|  1 | PRIMARY     | orders   | ref   | PRIMARY,i_o_custkey                        | i_o_custkey           | 5       | dbt3.customer.c_custkey |       7 | Using where                     | 
|  1 | PRIMARY     | lineitem | ref   | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4       | dbt3.orders.o_orderkey  |       2 | Using index                     | 
|  2 | SUBQUERY    | lineitem | index | NULL                                       | PRIMARY               | 8       | NULL                    | 6001215 | NULL                            | 
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The only difference is that select_type of the subquery now is SUBQUERY. In other words, it is no longer dependent on the preceding tables, and can be executed only once. In my run of DBT-3 with MySQL 5.6, Query 18 takes only 6.8 seconds. Hence, we have gone from more than a month to just a few seconds! My colleague Guilhem has earlier written about another DBT-3 query that is improved with Subquery Materialization.