In my previous blog post, I showed an example of how the MySQL Optimizer found a better join order by taking into account the filtering effects of conditions. I also explained that for non-indexed columns the filtering estimate is just a guess, and that there is a risk for non-optimal query plans if the guess is off.
We have received a few bug reports on performance regressions when upgrading from 5.6 to 5.7 that are caused by the optimizer overestimating the filtering effect. In most cases, the cause of the regression is inaccurate filtering estimates for equality conditions on non-indexed columns with low cardinality. In this blog post, I will discuss three ways to handle such regressions:
- Create an index
- Use an optimizer hint to change the join order
- Disable condition filtering
Example: DBT-3 Query 21
We will look at Query 21 in the DBT-3 benchmark:
SELECT s_name, COUNT(*) AS numwait
FROM supplier
JOIN lineitem l1 ON s_suppkey = l1.l_suppkey
JOIN orders ON o_orderkey = l1.l_orderkey
JOIN nation ON s_nationkey = n_nationkey
WHERE o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (SELECT * FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (SELECT * FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND n_name = 'JAPAN'
GROUP BY s_name ORDER BY numwait DESC, s_name LIMIT 100;
Query 21 is called Suppliers Who Kept Orders Waiting Query. In MySQL 5.7, Visual EXPLAIN shows the following query plan for Query 21:
The four tables of the join are joined from left-to-right, starting
with a full table scan of the orders
table. There are
also two dependent subqueries on the lineitem
table that
will be executed for each row of the outer lineitem
table. The execution time for this query plan is almost 25 seconds on
a scale factor 1 DBT-3 database. This is more than ten times as long
as the query plan used in MySQL 5.6!
The filtered column of tabular EXPLAIN shows the optimizer's estimates for the condition filter effects (some of the columns have been removed to save space):
id | select_type | table | type | key | rows | filtered | Extra |
---|---|---|---|---|---|---|---|
1 | PRIMARY | orders | ALL | NULL | 1500000 | 10.00 | Using where; Using temporary; Using filesort |
1 | PRIMARY | l1 | ref | PRIMARY | 4 | 33.33 | Using where |
1 | PRIMARY | supplier | eq_ref | PRIMARY | 1 | 100.00 | Using index condition |
1 | PRIMARY | nation | ALL | NULL | 25 | 4.00 | Using where; Using join buffer (Block Nested Loop) |
3 | DEPENDENT SUBQUERY | l3 | ref | PRIMARY | 4 | 30.00 | Using where |
2 | DEPENDENT SUBQUERY | l2 | ref | PRIMARY | 4 | 90.00 | Using where |
This shows that the optimizer assumes that the condition
o_orderstatus = 'F'
is satisfied by 10% of the rows in
the orders
table. Hence, the optimizer thinks that it
will be possible to filter out a lot of orders early by starting with
the orders
table. However, the truth is that almost 50%
of the rows have the requested order status. In other words, by
overestimating the filtering effect for orders
, query
plans that start with the orders
table will appear to be
less costly than is actually the case.
We will now look at how we can influence the optimizer to pick a better query plan for this query.
Option 1: Create an Index
As mentioned, the optimizer does not have any statistics on
non-indexed columns. So one way to improve the optimizer's precision
is to create an index on the column. For Query 21, since the filtering
estimate for o_orderstatus
is way off, we can try to see
what happens if we create an index on this column:
CREATE INDEX i_o_orderstatus ON orders(o_orderstatus);
id | select_type | table | type | key | rows | filtered | Extra |
---|---|---|---|---|---|---|---|
1 | PRIMARY | nation | ALL | NULL | 25 | 10.00 | Using where; Using temporary; Using filesort |
1 | PRIMARY | supplier | ref | i_s_nationkey | 400 | 100.00 | NULL |
1 | PRIMARY | l1 | ref | i_l_suppkey | 600 | 33.33 | Using where |
1 | PRIMARY | orders | eq_ref | PRIMARY | 1 | 50.00 | Using where |
3 | DEPENDENT SUBQUERY | l3 | ref | PRIMARY | 4 | 30.00 | Using where |
2 | DEPENDENT SUBQUERY | l2 | ref | PRIMARY | 4 | 90.00 | Using where |
We see from the EXPLAIN output that the estimated filtering effect for
orders
is now 50%. Given that, the optimizer prefers a
different join order, starting with the nation
table. This is the same join order as one got in MySQL 5.6, and the
execution time with this plan is 2.5 seconds. Instead of accessing 50% of all
orders, the query will now just access orders for suppliers in Japan.
However, this improvement comes at the cost of having to maintain an
index that will probably never be used!
Looking at Query 21, there is also an equality condition on another column
without an index; n_name
of the nation
table. For this column, 10% is actually a too high estimate. There
are 25 nations in the table. Hence, the correct estimate should be
4%. What if we, instead, create an index on this column?
DROP INDEX i_o_orderstatus ON orders;
CREATE INDEX i_n_name ON nation(n_name);
id | select_type | table | type | key | rows | filtered | Extra |
---|---|---|---|---|---|---|---|
1 | PRIMARY | nation | ref | i_n_name | 1 | 100.00 | Using index; Using temporary; Using filesort |
1 | PRIMARY | supplier | ref | i_s_nationkey | 400 | 100.00 | NULL |
1 | PRIMARY | l1 | ref | i_l_suppkey | 600 | 33.33 | Using where |
1 | PRIMARY | orders | eq_ref | PRIMARY | 1 | 10.00 | Using where |
3 | DEPENDENT SUBQUERY | l3 | ref | PRIMARY | 4 | 30.00 | Using where |
2 | DEPENDENT SUBQUERY | l2 | ref | PRIMARY | 4 | 90.00 | Using where |
In this case, our new index is actually used! Since scanning a table with 25 rows takes a neglible part of the total execution time, the savings for Query 21 are insignificant, but there might be other queries where such an index could be more useful.
Option 2: Join Order Hint
Instead of trying to improve statistics to get a better query plan, we
can use hints to influence the optimizer's choice of query plan. The
STRAIGHT_JOIN
hint can be used to change the join order.
It comes in two flavors:
STRAIGHT_JOIN
right afterSELECT
STRAIGHT_JOIN
used as a join operator
SELECT STRAIGHT_JOIN … FROM t1, t2, t3
Joins tables in the order specified in FROM clause (i.e., t1
→ t2 → t3
) … FROM t1 STRAIGHT_JOIN t2 …
Left-hand table should be processed before right-hand table(i.e.,
… t1 → … → t2 …
)
nation
should be processed before orders
:
SELECT s_name, COUNT(*) AS numwait
FROM supplier
JOIN lineitem l1 ON s_suppkey = l1.l_suppkey
JOIN nation ON s_nationkey = n_nationkey
STRAIGHT_JOIN orders ON o_orderkey = l1.l_orderkey
WHERE o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (SELECT * FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (SELECT * FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND n_name = 'JAPAN'
GROUP BY s_name ORDER BY numwait DESC, s_name LIMIT 100;
nation
comes before orders
, and the
resulting query plan is the "good one":
id | select_type | table | type | key | rows | filtered | Extra |
---|---|---|---|---|---|---|---|
1 | PRIMARY | nation | ALL | NULL | 25 | 10.00 | Using where; Using temporary; Using filesort |
1 | PRIMARY | supplier | ref | i_s_nationkey | 400 | 100.00 | NULL |
1 | PRIMARY | l1 | ref | i_l_suppkey | 600 | 33.33 | Using where |
1 | PRIMARY | orders | eq_ref | PRIMARY | 1 | 10.00 | Using where |
3 | DEPENDENT SUBQUERY | l3 | ref | PRIMARY | 4 | 30.00 | Using where |
2 | DEPENDENT SUBQUERY | l2 | ref | PRIMARY | 4 | 90.00 | Using where |
In order to user STRAIGHT_JOIN
we had to rearrange the
tables in the FROM
clause. This is a bit cumbersome, and
to avoid this, we have in MySQL 8.0 introduced new join order hints
that uses the new optimizer hint syntax. Using this syntax, we can
add hints right after SELECT
and avoid editing the rest
of the query. In the case of Query 21, we can add hints like
SELECT /*+ JOIN_PREFIX(nation) */ …
or
SELECT /*+ JOIN_ORDER(nation, orders)
*/ …
to achieve the desired query plan.
Option 3: Disable Condition Filtering
Many optimizer features can be disabled by setting theoptimizer_switch
variable. The following statement will
make the optimizer not use condition filtering estimates:
SET optimizer_switch='condition_fanout_filter=off';
id | select_type | table | type | key | rows | filtered | Extra |
---|---|---|---|---|---|---|---|
1 | PRIMARY | nation | ALL | NULL | 25 | 100.00 | Using where; Using temporary; Using filesort |
1 | PRIMARY | supplier | ref | i_s_nationkey | 400 | 100.00 | NULL |
1 | PRIMARY | l1 | ref | i_l_suppkey | 600 | 100.00 | Using where |
1 | PRIMARY | orders | eq_ref | PRIMARY | 1 | 100.00 | Using where |
3 | DEPENDENT SUBQUERY | l3 | ref | PRIMARY | 4 | 100.00 | Using where |
2 | DEPENDENT SUBQUERY | l2 | ref | PRIMARY | 4 | 100.00 | Using where |
Note that you can set optimizer_switch
at session level.
Hence, it is possible to disable condition filtering for
individual queries. However, this requires extra round-trips to the
server to set optimizer_switch
before and after the
execution of the query.
(Option 4: Wait for Histograms)
We are working to improve the statistics available to the optimizer by introducing histograms. A histogram provides more detailed information about the data distribution in a table column. With histograms, the optimizer will be able to estimate pretty accurately the filtering effects also for conditions on non-indexed columns. Until then, you will have to resort to one of options presented above to improve bad query plans caused by inaccurate filtering estimates.