Øystein on MySQL Optimizer: Transformations of Correlated Scalar Subqueries

MySQL 8.0.24 introduces transformation of correlated scalar subqueries to derived tables. In my presentation in the MySQL devroom at FOSDEM earlier this year, I showed how we can manually rewrite such queries. The automatic transformation in MySQL 8.0.24 is off by default, and in this blog post, I will discuss when it makes sense to turn it on. We will also look at how this automatic transformation compares to manual rewritten queries.

Example query

I will use the same query as in my FOSDEM presentation:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
WHERE p_mfgr = 'Manufacturer#1' 
  AND l_quantity < 
      (SELECT 0.2 * AVG(l_quantity) 
       FROM lineitem 
       WHERE l_partkey = p_partkey);

This query uses a scalar subquery to compute the average quantity that is ordered for a part, and this is used to find the average price per year of small orders (orders for less than 20% of the average quantity). The schema used for this query is from the TPC-H benchmark.

Here is the default query plan for this query as shown by EXPLAIN:

We see that the subquery is a dependent subquery. This means that the subquery will be executed for each combination of rows from part and lineitem tables that satisfies the condition on p_mfgr. According to EXPLAIN, there is on average 29 orders per part, so this means that the subquery will compute the same value 29 times for each part! This does not sound like an optimal plan!

Enabling the query transformation

We can enable the query transformation in MySQL 8.0.24 by turning on the optimizer switch subquery_to_derived:

EXPLAIN now shows that the subquery has been converted to a derived table. This means that the subquery is first materialized in a temporary table, and this table is joined with the tables of the outer query. In other words, we will now only compute the average quantity for each part once. The transformed version of the query is equivalent to the following query:

WITH pq(avg_qty, pk) AS (SELECT 0.2 * AVG(l_quantity), l_partkey 
                         FROM lineitem GROUP BY l_partkey)
SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
              JOIN pq ON pq.pk = p_partkey 
WHERE p_mfgr = 'Manufacturer#1' AND l_quantity < pq.avg_qty;

(Note that I, for readability, have used a Common Table Expression (CTE) instead of a derived table, but in this context they will be handled the same way.)

If we compare the performance for the two queries, using a TPC-H scale factor 10 database, we see that the transformed query runs in 5.5 minutes, while the original query takes almost 22 minutes. So the transformation makes the query 4 times faster in this particular case.

Note that while we now only compute the average quantity for each part once, we will compute the average for many parts that we are not interested in since we are only looking for parts from a specific manufacturer. This means that whether the query transformation will speed up the query or not, will depend on how much we save by computing each value only once compared to the extra work needed to compute values we do not need. In this case, 20% of the parts were from the specified manufacturer so the total work were reduced by applying the query transformation.

A slightly different query

Let's see what happens if we make the condition on the part table more selective:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN' 
  AND l_quantity < 
      (SELECT 0.2 * AVG(l_quantity) 
       FROM lineitem 
       WHERE l_partkey = p_partkey);

This query is Query 17 of the TPC-H benchmark, and here only 0.1% of the rows in the part table satisfies the WHERE condition. In this case, the original query takes 7.3 seconds while the transformed query takes 4.5 minutes. So here we would definitely prefer that the original query was used, and this explains why this transformation is not enabled by default.

Table push-down

As we have seen above, the transformation from scalar subquery to derived table would have been more efficient if we could make sure that the computations of the scalar value are only made for the rows that are relevant to the query. This could be achieved if the conditions from the outer query could be applied in the derived table. For Q17, since the conditions are on the part table, this would require that we could add the part table to the derived table. Since the subquery is correlated with the part table, we can do this and use the correlated condition as the join condition. Note that we can only add the table to the derived table if the correlation between the table and the scalar subquery is on a unique column. Otherwise, the join in the derived table may generate additional rows that may skew the aggregated value^[1]. In our case, since the correlation is on p_partkey which is the primary key of part, it is safe to add part to the derived table. In addition, since the only information the outer query needs from the part table is p_partkey, we can remove part from the outer query:

WITH pq(avg_qty, pk) AS 
         (SELECT 0.2 * AVG(l_quantity), l_partkey 
          FROM lineitem JOIN part ON p_partkey = l_partkey 
          WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN'
          GROUP BY l_partkey) 
SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN pq ON pq.pk = l_partkey 
WHERE l_quantity < pq.avg_qty;

This query runs in 1.4 seconds, more than 5 times faster than the original query. The query plan is as follows:

The steps of this query plan is:

Full table scan of part
For each row in part that satisfy the conditions on part (estimated 0.1%):
1. Find matching rows in lineitem using the i_l_partkey index
2. Compute the average quantity for matching rows
3. Add the result to the temporary table <derived2>
Scan the temporary table <derived2>
For each row in the <derived2>:
1. Find matching rows in lineitem using the i_l_partkey index
2. Add l_extendedprice to the total sum for rows that satisfy the condition on l_quantity

As we can see from step 2, we will now only do index look-ups in lineitem and compute averages for "interesting" parts.

Lateral derived table

MySQL 8.0.14 added support for lateral derived tables. A lateral derived table differs from an ordinary derived table in that it can refer to columns of other tables in the FROM clause. Rewriting query 17 to use a lateral derived table is straight-forward:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
JOIN LATERAL (SELECT 0.2 * AVG(l_quantity) 
              FROM lineitem 
              WHERE l_partkey = p_partkey) ldt(avg_qty) 
WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN' 
  AND l_quantity < avg_qty;

This query gets the following query plan:

This plan is basically the same as the query plan for the original query with one exception: We will evaluate the conditions on part before we join with lineitem. This is possible because by using a lateral derived table, we have decoupled the subquery from the condition on l_quantity. One thing to note about the EXPLAIN output is "Rematerialize (<derived2>)" on the first line. This means that for each row from part that qualifies, we will clear the temporary table, and rerun the correlated subquery. In other words, for this query, there will only be one row at a time in the temporary table.

For Query 17, using a lateral derived table is slightly faster than an ordinary derived table with table push-down (1.3 versus 1.4 seconds). For our first variant with a less selective WHERE condition, the difference is more significant (99 seconds versus 120 seconds). Why is the plan with a lateral derived table so much faster when both plans will access the same number of rows? An inspection with perf indicates that the final look-up into lineitem is more efficient for the query that use a lateral derived table. This is probably because we in that case will complete all query operations for one part before moving on to the next. The second look-up for a given part into the lineitem table will then be done right after the first. It is likely that this access pattern will result in less CPU cache misses.

Summary

The below diagram sums up the performance for the variants of the two queries that I have used in this blog post. Q17 is the original TPC-H Q17, while Q17big is the variant with a less selective WHERE clause. (Note that a logarithmic scale is used for the vertical axis to be able to present the results from the queries in the same diagram.)

While the query transformation introduced in MySQL 8.0, gives a good improvement for Q17big, it is disastrous for Q17. As we have seen, the usefulness of this query transformation depends on the selectivity of the WHERE clause. This automatic transformation will probably not be very useful to users unless the query optimizer is able to do a cost-based decision whether to use it or not. Also, for both queries there are manual rewrites that are much better. Particular, the use of lateral derived tables looks very promising, and, as we have seen, it is pretty straight-forward to rewrite a query to use a lateral derived table instead of a scalar correlated subquery.

^[1]If the column is not unique, semi-join could be used. See Dreseler et. al., "Quantifying TPC-H Choke Points and Their Optimizations" for details.

Øystein on MySQL Optimizer

Friday, May 7, 2021

Transformations of Correlated Scalar Subqueries