Friday, January 26, 2018

Query Optimizer Takes Data Buffering into Account

Last month I published a blog post at mysqlserverteam.com about how the Query Optimizer in MySQL 8.0 takes advantage of that InnoDB provides buffer estimates per table and index. In that blog post I showed how we got different query plans for Query 8 of the DBT-3 benchmark depending on whether the data was cached InnoDB's buffer pool or not.

In MySQL 5.6, we got a query plan that was well suited for a disk-bound scenario, while MySQL 5.7 switched to a query plan that reduces the execution time by 90% when all data is in memory. However, that came at the expense of longer execution times when data had to be read from disk. In MySQL 8.0 you get the best of both worlds. One query plan will be used when all data is in memory, and another plan will be used when most data need to be fetched from disk.

DBT-3 Query 21

I earlier blogged about another DBT-3 query, Query 21. This query showed a performance regression from MySQL 5.6 to MySQL 5.7 because the query optimizer overestimated the filtering effect of the WHERE clause. In that blog post, I promised that this can be avoided in MySQL 8.0 since you can create histograms to improve the filtering estimates. It turns out that the problem is avoided even without histograms.

In MySQL 5.7 we got this query plan for Query 21:

As discussed in the previous blog post, in MySQL 5.7, the optimizer chooses to start with table orders because it overestimates the filtering on this table. In MySQL 8.0 we still get the above query plan when all the data needs to be read from disk. However, if all data is in memory we get the following query plan:

As you can see from the diagram, the join order has been reversed. Using this new query plan reduces the execution time by 85% when all data is in memory. However, since this plan uses secondary indexes for two of the tables, it is not a good plan when most data has to be read from disk. In fact, if the optimizer had picked this plan in a disk-bound scenario, the query would have been 2.5 times slower than with the plan from 5.7. Hence, by taking into account whether data is in memory or needs to be read from disk, the optimizer is able to find a good plan for Query 21 in both scenarios.

Caveat

One thing to be aware of with this new optimizer behavior, is that running EXPLAIN on a query after it was executed, will not necessarily show the query plan that was used during execution. The previous execution of the query may have changed the content of the buffer pool. Hence, the optimizer may pick a different query plan for the next execution of the query.