Monday, September 29, 2014

Speaking at MySQL Central @ Open World

I am speaking at two sessions at Oracle Open World this week; one tutorial and one hands-on-lab:
I can also recommend the other sessions by members of the MySQL Optimizer Team:
Looking forward to meeting and talking with you at Oracle Open World!

Friday, July 18, 2014

MySQL Webinar: MySQL EXPLAIN, explained

Some time ago, Matt Lord and I delivered a webinar on the MySQL EXPLAIN feature.  This webinar is available for on-demand access here.  Based on the questions we got during the webinar, I want to emphasize that EXPLAIN does not execute the query, it only determines the query plan for the query.  Hence, EXPLAIN will not be able to evaluate how good the chosen query plan actually is.

If you have questions on this topic after listening to this webinar, feel free to ask questions; either as comments on this blog or at the MySQL Optimizer Forum.

You can also access other webinars on MySQL. New webinars will be announced here.

Monday, July 14, 2014

A Multi-Table Trick to Speed up Single-Table UPDATE/DELETE Statements

This post appeared first on mysqlserverteam.com

In MySQL, query optimization of single-table UPDATE/DELETE statements is more limited than for SELECT statements. I guess the main reason for this is to limit the optimizer overhead for very simple statements. However, this also means that optimization opportunities are sometimes missed for more complex UPDATE/DELETE statements.

Example

Using the DBT-3 database, the following SQL statement will increase prices by 10% on parts from suppliers in the specified country:
UPDATE part
SET p_retailprice = p_retailprice*1.10
WHERE p_partkey IN
     (SELECT ps_partkey
      FROM partsupp JOIN supplier
      ON ps_suppkey = s_suppkey
      WHERE s_nationkey = 4);

Visual EXPLAIN in MySQL Workbench shows that the optimizer will choose the following execution plan for this UPDATE statement:

update-subquery

That is, for every row in the part table, MySQL will check if this part is supplied by a supplier of the requested nationality. Consider the following similar SELECT statement:
SELECT * FROM part 
WHERE p_partkey IN 
     (SELECT ps_partkey 
      FROM partsupp JOIN supplier 
      ON ps_suppkey = s_suppkey 
      WHERE s_nationkey = 4);

In MySQL 5.6, the query optimizer will apply semi-join transformation to this query. Hence, the execution plan is quite different from the similar UPDATE statement:

select-semijoin

As you can see, there is no sub-query in this plan. The query has been transformed into a three-way join. The great advantage of this semi-join transformation is that the optimizer is now free to re-arrange the order of the tables to be joined. Instead of having to go through all 179,000 parts, it will now start with the estimated 414 suppliers from the given country and find all parts supplied by them. This is obviously more efficient, and it would be good if MySQL would use the same approach for the UPDATE statement.

The Multi-Table Trick

Unlike single-table UPDATE statements, the MySQL Optimizer will use all available optimizations for multi-table UPDATE statements. This means that by rewriting the query as follows, the semi-join optimizations will apply:
UPDATE part, (SELECT 1) dummy 
SET p_retailprice = p_retailprice*1.10
WHERE p_partkey IN 
     (SELECT ps_partkey 
      FROM partsupp JOIN supplier 
      ON ps_suppkey = s_suppkey 
      WHERE s_nationkey = 4);

Notice the extra dummy table in the first line. Here is what happens when I execute the single-table and multi-table variants on a DBT-3 database (scale factor 1):

mysql> UPDATE part SET p_retailprice = p_retailprice*1.10 WHERE p_partkey IN (SELECT ps_partkey FROM partsupp JOIN supplier ON ps_suppkey = s_suppkey WHERE s_nationkey = 4);
Query OK, 31097 rows affected, 28003 warnings (2.63 sec)
Rows matched: 31097  Changed: 31097  Warnings: 28003

mysql> ROLLBACK;
Query OK, 0 rows affected (0.20 sec)

mysql> UPDATE part, (SELECT 1) dummy SET p_retailprice = p_retailprice*1.10 WHERE p_partkey IN (SELECT ps_partkey FROM partsupp JOIN supplier ON ps_suppkey = s_suppkey WHERE s_nationkey = 4); 
Query OK, 31097 rows affected, 28003 warnings (0.40 sec)
Rows matched: 31097  Changed: 31097  Warnings: 28003

As you can see, execution time is reduced from 2.63 seconds to 0.40 seconds by using this trick. (I had executed both statements several times before, so the reported execution times are for a steady state with all accessed data in memory.)

Multi-Table DELETE

The same trick can be used for DELETE statements. Instead of the single-table variant,
DELETE FROM part
WHERE p_partkey IN
       (SELECT ps_partkey
        FROM partsupp JOIN supplier
        ON ps_suppkey = s_suppkey
        WHERE s_nationkey = 4);

you can use the equivalent multi-table variant:

DELETE part FROM part
WHERE p_partkey IN
       (SELECT ps_partkey
        FROM partsupp JOIN supplier
        ON ps_suppkey = s_suppkey
        WHERE s_nationkey = 4);

This rewrite gives a similar performance improvement as reported for the above UPDATE statement.

Thursday, May 2, 2013

When is a Subquery Executed?

In an earlier blog post, I managed to confuse myself as to when a subquery was executed. It is not very clear from the output of EXPLAIN where the execution of a subquery takes place. Let's take a look at the following example query (Query 17 in the DBT-3 benchmark):

select sum(l_extendedprice) / 7.0 as avg_yearly
from lineitem, part
where p_partkey = l_partkey
  and p_brand = 'Brand#33' and p_container = 'LG CAN'
  and l_quantity < (
 select 0.2 * avg(l_quantity)
 from lineitem
 where l_partkey = p_partkey
);

If you run EXPLAIN on this query, you will see the following execution plan:

+----+--------------------+----------+------+---------------------------------+---------------------+---------+---------------------+---------+-------------+
| id | select_type        | table    | type | possible_keys                   | key                 | key_len | ref                 | rows    | Extra       |
+----+--------------------+----------+------+---------------------------------+---------------------+---------+---------------------+---------+-------------+
|  1 | PRIMARY            | part     | ALL  | PRIMARY                         | NULL                | NULL    | NULL                | 2000000 | Using where |
|  1 | PRIMARY            | lineitem | ref  | i_l_suppkey_partkey,i_l_partkey | i_l_suppkey_partkey | 5       | dbt3.part.p_partkey |      14 | Using where |
|  2 | DEPENDENT SUBQUERY | lineitem | ref  | i_l_suppkey_partkey,i_l_partkey | i_l_suppkey_partkey | 5       | dbt3.part.p_partkey |      14 | NULL        |
+----+--------------------+----------+------+---------------------------------+---------------------+---------+---------------------+---------+-------------+

It is not straightforward to determine from this plan when the subquery will be executed. We see that both tables of the join has "Using where" in the Extra column. In other words, the original WHERE clause has been split between the two tables. The question is: Which part contains the subquery?

In this specific case, the answer is pretty evident when looking closer at the query. The subquery is part of an expression which refers to a column of the lineitem table, l_quantity, and within the subquery, there is also also a reference to p_partkey of the part table. In other words, columns from both tables of the join is needed when executing the subqquery. Hence, the subquery will be part of the where clause attached to the last table of the join.

This means that the above query will, for each row in the part table that satisfies the condition attached to the part table, look up matching rows in the lineitem table, and, for every matching row, execute the subquery. As you probably realize, one will have to know quite a bit about how the optimizer handles WHERE conditions in order to deduce this. But don't despair! An extension to EXPLAIN in MySQL 5.6 make things much clearer.

Structured EXPLAIN

With MySQL 5.6, you can get a structured version of the query plan in JSON format. All you need to do is to write "EXPLAIN FORMAT=JSON" instead of just EXPLAIN. Let's take a look at what the JSON EXPLAIN output for the query discussed above looks like:

{
  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "part",
          "access_type": "ALL",
          "possible_keys": [
            "PRIMARY"
          ] /* possible_keys */,
          "rows": 2000000,
          "filtered": 100,
          "attached_condition": "((`dbt3`.`part`.`p_container` = 'LG CAN') and (`dbt3`.`part`.`p_brand` = 'Brand#33'))"
        } /* table */
      },
      {
        "table": {
          "table_name": "lineitem",
          "access_type": "ref",
          "possible_keys": [
            "i_l_suppkey_partkey",
            "i_l_partkey"
          ] /* possible_keys */,
          "key": "i_l_suppkey_partkey",
          "used_key_parts": [
            "l_partkey"
          ] /* used_key_parts */,
          "key_length": "5",
          "ref": [
            "dbt3.part.p_partkey"
          ] /* ref */,
          "rows": 14,
          "filtered": 100,
          "attached_condition": "(`dbt3`.`lineitem`.`l_quantity` < (/* select#2 */ select (0.2 * avg(`dbt3`.`lineitem`.`l_quantity`)) from `dbt3`.`lineitem` where (`dbt3`.`lineitem`.`l_partkey` = `dbt3`.`part`.`p_partkey`)))",
          "attached_subqueries": [
            {
              "dependent": true,
              "cacheable": false,
              "query_block": {
                "select_id": 2,
                "table": {
                  "table_name": "lineitem",
                  "access_type": "ref",
                  "possible_keys": [
                    "i_l_suppkey_partkey",
                    "i_l_partkey"
                  ] /* possible_keys */,
                  "key": "i_l_suppkey_partkey",
                  "used_key_parts": [
                    "l_partkey"
                  ] /* used_key_parts */,
                  "key_length": "5",
                  "ref": [
                    "dbt3.part.p_partkey"
                  ] /* ref */,
                  "rows": 14,
                  "filtered": 100
                } /* table */
              } /* query_block */
            }
          ] /* attached_subqueries */
        } /* table */
      }
    ] /* nested_loop */
  } /* query_block */
}

You probably recognize much of the information from traditional EXPLAIN, but there is also some information that traditional EXPLAIN do not show. The "attached_condition" field shows which parts of the WHERE condition is evaluated at which stage of the query execution. That is, the condition attached to a table will be evaluated for each row that is read from this table. In this specific example, it shows that the subquery is part of the condition attached to the lineitem table, and the "attached_subqueries" field for that table contains the execution plan for the subquery.

Visual EXPLAIN

Structured EXPLAIN, or EXPLAIN in JSON format, is also much more suitable for tools that want to process and display query plans. MySQL Workbench 5.2 uses structural EXPLAIN to visualize MySQL query plans. Below is shown the result of Visual EXPLAIN for the example query used in this blog post:

Note that MySQL Workbench colors the table boxes based on which access method is used. A red box indicates a tables scan, while a green box is used for index look-ups (ref access).

As we all know: "A picture is worth a thousand words", and from the above picture it should be pretty clear when the subquery is executed.

Friday, February 22, 2013

When and How to Take Advantage of New Optimizer Features in MySQL 5.6

A few weeks ago, I made a presentation with the above title at FOSDEM in Brussels. My slides can be found here. The MySQL and Friends devroom had many interesting presentations and the room was full for most of them.

Thursday, September 27, 2012

Speaking at MySQL Connect This Week-end

I will give a talk at MySQL Connect where I present examples of how MySQL 5.6 improves the performance of many of the queries in the DBT-3 benchmark. I will also be giving a brief description of the relevant new optimization techniques and examples of the types of queries that will benefit from these techniques. My presentation is on Sunday (September 30) at 1.15pm.

I will also like to point you to the other presentations made by member of the MySQL Optimizer team:

Olav Sandstå: MySQL Optimizer Overview (Saturday at 11:30am)
Manyi Lu: Overview of New Optimizer Features in MySQL 5.6 (Saturday at 1:00pm)
Evgeny Potemkin: Powerful EXPLAIN in MySQL 5.6 (Sunday at 4:15pm)

We will also be having a BOF on Saturday evening (7:00 am) where we like people to come and give us some input on which query optimizations they would like us to work on for future releases.

Saturday, July 7, 2012

From Months to Seconds with Subquery Materialization


In an earlier blog post, I showed how optimizer improvements in MySQL 5.6 gave better performance for several of the queries in the DBT-3 benchmark.
However, for one of the queries, Query 18, I was not able to give exact numbers for the improvement since the query took very long in MySQL 5.5. I decided to try to find out exactly how long the query would take, but when the query had run for one month, I gave up. How can a query take so long? Especially, when I had set up InnoDB with a buffer pool that should be large enough to hold the entire database. Let's have a look at the query:

select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
from customer, orders, lineitem
where o_orderkey in (
                select l_orderkey
                from lineitem
                group by l_orderkey
                having sum(l_quantity) > 313
  )
  and c_custkey = o_custkey
  and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate
LIMIT 100;

This query will find the orders from customers that have placed big orders. The reason that this takes so long in MySQL 5.5, is that the subquery in the WHERE clause will be executed for each processed row of the table for which this subquery is part of the WHERE predicate. Let's look at the EXPLAIN output for this query:
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type        | table    | type  | possible_keys                              | key                   | key_len | ref                     | rows    | Extra                           |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
|  1 | PRIMARY            | customer | ALL   | PRIMARY                                    | NULL                  | NULL    | NULL                    |  150000 | Using temporary; Using filesort | 
|  1 | PRIMARY            | orders   | ref   | PRIMARY,i_o_custkey                        | i_o_custkey           | 5       | dbt3.customer.c_custkey |       7 | Using where                     | 
|  1 | PRIMARY            | lineitem | ref   | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4       | dbt3.orders.o_orderkey  |       2 | Using index                     | 
|  2 | DEPENDENT SUBQUERY | lineitem | index | NULL                                       | PRIMARY               | 8       | NULL                    | 6001215 | NULL                            | 
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The select_type for the subquery is DEPENDENT SUBQUERY. Since the left hand side of the IN-expression is a field from the orders table, the subquery will be executed for each row processed from this table. This implies that the index scan of the lineitem table will be performed more than one million times given the rows estimates in the EXPLAIN output (150000*7). I have measured that one execution of the sub-query takes about 3.5 seconds when all the data is in memory. Hence, it will take more than 40 days to execute it one million times.

In MySQL 5.6, Subquery Materialization may be used to avoid the repeated execution of subqueries. This implies that the subquery is executed once and the result stored (materialized) in a temporary table. Then, for each row where the subquery was earlier executed, a hash-based look-up into the temporary table will be made instead to check whether there is a match. The EXPLAIN output for Query 18, looks like this in MySQL 5.6:

+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table    | type  | possible_keys                              | key                   | key_len | ref                     | rows    | Extra                           |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
|  1 | PRIMARY     | customer | ALL   | PRIMARY                                    | NULL                  | NULL    | NULL                    |  150000 | Using temporary; Using filesort | 
|  1 | PRIMARY     | orders   | ref   | PRIMARY,i_o_custkey                        | i_o_custkey           | 5       | dbt3.customer.c_custkey |       7 | Using where                     | 
|  1 | PRIMARY     | lineitem | ref   | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4       | dbt3.orders.o_orderkey  |       2 | Using index                     | 
|  2 | SUBQUERY    | lineitem | index | NULL                                       | PRIMARY               | 8       | NULL                    | 6001215 | NULL                            | 
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+

The only difference is that select_type of the subquery now is SUBQUERY. In other words, it is no longer dependent on the preceding tables, and can be executed only once. In my run of DBT-3 with MySQL 5.6, Query 18 takes only 6.8 seconds. Hence, we have gone from more than a month to just a few seconds! My colleague Guilhem has earlier written about another DBT-3 query that is improved with Subquery Materialization.