Wednesday, September 8, 2021

Run ANALYZE TABLE — Do Not Rely on InnoDB's Automatic Recalculation of Statistics

This blog post is inspired by Jesper's recent blog post on how the automatic update of InnoDB persistent statistics may never trigger if servers are restarted frequently. However, the matter is even worse! In this blog post I will show that even when the automatic recalculation is performed, there are no guarantees as to when the server will see the changes.

Myths about persistent statistics

I must admit that I, for a long time, believed in the myths I was told about when updates to InnoDB's index statistics become visible to the query optimizer. There were basically two variants:

  • The updated statistics will only be visible to new connections (i.e., sessions that are started after the statistics was updated)
  • When a new connection accesses the table, the updated statistics will be visible to all connections.
As you already may have guessed, none of these statements are actually true, and below I will present a small experiment that shows this. The conclusion is that there are really no guarantees as to when the query optimizer will see statistics refreshed by InnoDB, and that you need to regularly run ANALYZE TABLE to guarantee this. If you are not interested in my proof for this, you may skip to the end of the blog where I will provide some recommendations for running ANALYZE TABLE.

When does InnoDB's recalculation of statistics become visible?

We will use the following table for the experiment:
con1> SHOW CREATE TABLE t2\G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `i` int NOT NULL,
  `j` int DEFAULT NULL,
  PRIMARY KEY (`i`),
  KEY `j` (`j`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

con1> INSERT INTO t2 SELECT i, j FROM t1;
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0
At this point, both tables, t1 and t2, will have 100 rows with unique values for both columns i and j. If we look at the persistent statistics, we see that both of t2's indexes contains 100 unique values:
con1> SELECT index_name, stat_name, stat_value, stat_description
    -> FROM mysql.innodb_index_stats WHERE table_name = 't2';
+------------+--------------+------------+-----------------------------------+
| index_name | stat_name    | stat_value | stat_description                  |
+------------+--------------+------------+-----------------------------------+
| PRIMARY    | n_diff_pfx01 |        100 | i                                 |
| PRIMARY    | n_leaf_pages |          1 | Number of leaf pages in the index |
| PRIMARY    | size         |          1 | Number of pages in the index      |
| j          | n_diff_pfx01 |        100 | j                                 |
| j          | n_diff_pfx02 |        100 | j,i                               |
| j          | n_leaf_pages |          1 | Number of leaf pages in the index |
| j          | size         |          1 | Number of pages in the index      |
+------------+--------------+------------+-----------------------------------+
7 rows in set (0.00 sec)
If we look at the query plan for a join of the two tables, we see that the rows column for t2 contains 1. In other words, for each row in t1, there is one matching row in t2:
con1> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    1 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
We will now add 100 more rows to t2 so that there are 2 rows for each value of j:
con1> INSERT INTO t2 SELECT i+100, j FROM t1;
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

con1> SELECT index_name, stat_name, stat_value, stat_description
    -> FROM mysql.innodb_index_stats WHERE table_name = 't2';
+------------+--------------+------------+-----------------------------------+
| index_name | stat_name    | stat_value | stat_description                  |
+------------+--------------+------------+-----------------------------------+
| PRIMARY    | n_diff_pfx01 |        200 | i                                 |
| PRIMARY    | n_leaf_pages |          1 | Number of leaf pages in the index |
| PRIMARY    | size         |          1 | Number of pages in the index      |
| j          | n_diff_pfx01 |        100 | j                                 |
| j          | n_diff_pfx02 |        200 | j,i                               |
| j          | n_leaf_pages |          1 | Number of leaf pages in the index |
| j          | size         |          1 | Number of pages in the index      |
+------------+--------------+------------+-----------------------------------+
7 rows in set (0.00 sec)
We see that the persistent statistics has been updated, but when we look at the query plan, the row estimate for t2 has not changed:
con1> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    1 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
The next step is to check what happens if we open another connection to the database:
$dbdeployer use msb_8_0_26
running /home/xxx/sandboxes/msb_8_0_26/ use
/home/xxx/sandboxes/msb_8_0_26/use
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 15
Server version: 8.0.26 MySQL Community Server - GPL

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql [localhost:8026] {msandbox} ((none)) > prompt con2>
PROMPT set to 'con2> '
con2> use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

con2> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    2 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
We see that the row estimate for t2 has been updated. When we check our first connection, we see that the row estimate has now also changed here:
con1> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    2 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

From this experiment it may seem that when a new connection accesses the table, the fresh statistics will be visible to all connections. However, it is not that straight-forward. When looking at the source code, we find that all connections share the same cached statistics, but this cache will only be automatically refreshed when a new table object is needed. Since closed table objects are cached for reuse, a query may not need to create a new object. So when we run our join query in the initial connection, we will reuse an existing table object, and the cached statistics will not be refreshed.

To improve scalability, the open tables cache is by default partitioned into 16 cache instances. This means that when we connect the second time, our new connection is assigned to a different instance where there are no cached table objects. Hence, a new table object is requested, and the statistics is refreshed in the process. To make all connections share the same open tables cache, we can set the system variable table_open_cache_instances to 1. Note that this is not a dynamic variable, so we will have to set it at startup.

We restart the server with this variable set to 1, and insert 100 more rows:

con1> select @@table_open_cache_instances;
+------------------------------+
| @@table_open_cache_instances |
+------------------------------+
|                            1 |
+------------------------------+
1 row in set (0.00 sec)

con1> INSERT INTO t2 SELECT i+200, j FROM t1;
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

con1> SELECT index_name, stat_name, stat_value, stat_description
    -> FROM mysql.innodb_index_stats WHERE table_name = 't2';
+------------+--------------+------------+-----------------------------------+
| index_name | stat_name    | stat_value | stat_description                  |
+------------+--------------+------------+-----------------------------------+
| PRIMARY    | n_diff_pfx01 |        300 | i                                 |
| PRIMARY    | n_leaf_pages |          1 | Number of leaf pages in the index |
| PRIMARY    | size         |          1 | Number of pages in the index      |
| j          | n_diff_pfx01 |        100 | j                                 |
| j          | n_diff_pfx02 |        300 | j,i                               |
| j          | n_leaf_pages |          1 | Number of leaf pages in the index |
| j          | size         |          1 | Number of pages in the index      |
+------------+--------------+------------+-----------------------------------+
7 rows in set (0.00 sec)
If the cached the statistics had been refreshed, we should now see that the row estimate for t2 had increased to 3, but, as expected, it is still 2:
con1> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    2 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
This time, when we connect with a second client, the row estimates is 2 also here:
con2> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    2 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
What happened here, was that the table object used by the first connection was reused, so the statistics was not updated. However, if we run a query that refer to t2 twice, an extra table object needs to be opened, and the statistics will be refreshed:
con2> EXPLAIN SELECT * FROM t2 AS t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  300 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    3 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

When a server has been running for a while, there will normally be multiple cached table objects in every cache instance. This means that unless you are close to your previous peak load, no new table objects will be created. In other words, it may take a long time before the query optimizer sees any of the recalculations of index statistics initiated by InnoDB. There is an old bug report that describes this behavior, but it has so far not been fixed.

ANALYZE TABLE

If we run ANALYZE TABLE, the cached statistics will be updated immediately:
con1> INSERT INTO t2 SELECT i+300, j FROM t1;
Query OK, 100 rows affected (0.00 sec)
Records: 100  Duplicates: 0  Warnings: 0

con1> EXPLAIN SELECT * FROM t2 AS t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  300 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    3 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

con1> ANALYZE TABLE t2;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.t2 | analyze | status   | OK       |
+---------+---------+----------+----------+
1 row in set (0.00 sec)

con1> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON t1.i=t2.j;
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
|  1 | SIMPLE      | t1    | NULL       | index | PRIMARY       | j    | 5       | NULL      |  100 |   100.00 | Using index |
|  1 | SIMPLE      | t2    | NULL       | ref   | j             | j    | 5       | test.t1.i |    4 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+------+---------+-----------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)

In other words, if you want to guarantee that the query optimizer has fresh statistics, you should run ANALYZE TABLE regularly. Below I will discuss how to best use ANALYZE TABLE.

When Should I run ANALYZE TABLE?

How often you need to run ANALYZE TABLE depends on the type of data in your index. For example, if the average number of orders per day does not change much over time, it is not necessary to refresh the statistics for an index on order_date very often. On the other hand, the same table of orders may have an index on customer_id that will require more frequent updates to the statistics since the number of orders per customer will (hopefully) increase over time. ANALYZE TABLE will update the statistics for all the indexes of a table. Hence, how often an update is needed, will be depend on the index where the average number of rows per value changes most rapidly.

In most cases, running it once per week should be sufficient, but there might be cases where it could be useful to run it more often, so my suggestion is to run in on a daily basis at a time where the load is low. Also, always make sure to run ANALYZE TABLE after bulk inserts/updates/deletes. If you run ANALYZE TABLE on a regular basis, I also suggest you turn off InnoDB's automatic recalculation by setting the system variable innodb_stats_auto_recalc to OFF.

Increase the sampling rate

The number of index pages that are sampled when calculating index statistics, is controlled by the system variable innodb_stats_persistent_sample_pages. By default, at most 20 pages are sampled, but this often results in very inaccurate estimates, and I suggest you increase this value to at least 200. If the automatic recalculation is turned off, so you have control over when the recalculation is performed, it should be OK to set it even higher. On my system, ANALYZE TABLE is able to process 5000 pages per second even when most of the pages have to be read from disk. In other words, if you set innodb_stats_persistent_sample_pages to 1000, you should be able to recalculate the index statistics for 300 indexes in less than a minute!

ANALYZE TABLE is no longer blocking

As Jesper wrote in his blog post, it used to be that ANALYZE TABLE would block all new queries until ongoing queries had completed. In order to safely rely on ANALYZE TABLE, you should make sure to use a release where this have been fixed. The issue was first fixed in Percona's releases a few years ago, and at Alibaba we have ported this fix to POLARDB. This year, MySQL 8.0.24 also provided a fix for this issue.

Conclusions

In this blog post I have shown that there is no guarantee that the index statistics is up-to-date if you solely rely on InnoDB's automatic updates. Instead, you should regularly run ANALYZE TABLE. To ensure more accurate statistics, it is also a good idea to increase the setting for innodb_stats_persistent_sample_pages.

I want to thank my colleague Kaiwang for insights and discussions that helped me understand this issue better.

Wednesday, June 9, 2021

Use Window Functions to Speed Up Correlated Subqueries

In my previous blog post, I discussed different ways to rewrite correlated scalar subqueries to improve performance. In this blog post, I will discuss how we can use window functions for a subset of these queries. Window functions was introduced in MySQL 8.0.

Use case

Often we run queries to find rows within a table with values that are, e.g., smaller, greater, or equal to some metric computed over a set of rows from the same table. Some examples are:

  • Find orders for more than 1% of the total revenue for the last week.
  • Find the supplier with the lowest price for a certain part.
  • Find orders for less than 20% of the average ordered quantity of a part.

The latter is a descripton of TPC-H Q17, which we discussed in my previous blog post:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN' 
  AND l_quantity < 
      (SELECT 0.2 * AVG(l_quantity) 
       FROM lineitem 
       WHERE l_partkey = p_partkey);

This query refers to the lineitem table twice:

  1. The subquery finds the average ordered quantity of a part
  2. The outer query access lineitem to find the price for the parts when less than 20% of the average quantity is ordered.

Rewrite to use window function

If we use window functions, we can compute the average AND collect the interesting information at the same time:

WITH win AS (
    SELECT l_extendedprice, l_quantity,
           AVG(l_quantity) OVER (PARTITION BY p_partkey) avg_l_quantity
    FROM lineitem, part
    WHERE p_partkey = l_partkey
      AND p_brand = 'Brand#11' AND p_container = 'SM CAN'
)	
SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly
FROM win
WHERE l_quantity < 0.2 * avg_l_quantity;
Here the Common Table Expression (CTE) will both fetch the necessary column values and compute the average quantity for each part. Then, the main query can just filter the rows from the CTE where l_quantity is less than 20% of the average.

There are some conditions that needs to be satisfied in order to safely apply this transformation:

  • The subquery refers a subset of the tables in the outer query
  • The subquery uses aggregation and there is no DISTINCT within the aggregation function
  • The aggregation function must have an equivalent window aggregation function
  • If there are multiple tables that are referred in the FROM clauses of both the subquery and the outer query, they should be joined on the same conditions
  • Any conditions in the WHERE clause should be the same for both subquery and outer query
  • The column(s) from the outer query that is referred in the subquery, p_partkey in the case of Q17 above, should be unique
  • No ORDER BY or LIMIT in the subquery
There are cases where we could still apply the transformation even if all conditions above are not met. However, then there is a risk that the transformed query will be less efficient than the original query.

Performance

Let us compare the performance of this query to the variants we tested in my previous blog post:

We see that the query that use window functions is more than 5 times faster than the original query, but it is not faster than the variant that use a Lateral Derived Table. In fact, it is around 2.5% slower.

How can it be slower when it accesses the lineitem table only once, while the query that use LATERAL access the table twice? The answer is that there is some overhead associated with the window function. Here is a flame graph for the query that use a Lateral Derived table:

The left-most and right-most stacks represents the index look-ups into the lineitem table. Notice that the stack to the left is somewhat narrower. As discussed in the previous blog post, this is because there will be less cache misses on the second access.

Compare this to the flame graph for the query that use a window function:

In this case there is only one stack for the function RefIterator<false>::Read, but there is a new stack to the right which represents the calculation by the window function. In other words, we save some, and we loose some, and overall the performance is about the same for both query variants.

TPC-H Query 2

Query 2 of the TPC-H benchmark can also be rewritten to use a window function. This query finds information on the suppliers in Europe with the lowest cost for certain parts:

SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, 
       s_address, s_phone, s_comment
FROM part, supplier, partsupp, nation, region
WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey
  AND p_size = 27 AND p_type LIKE '%BRASS'
  AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
  AND r_name = 'EUROPE'
  AND ps_supplycost = (
      SELECT MIN(ps_supplycost)
	  FROM partsupp, supplier, nation, region
	  WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey 
        AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
	    AND r_name = 'EUROPE'
  )
ORDER BY s_acctbal DESC, n_name, s_name, p_partkey
LIMIT 100;
 
In this query there are multiple tables that are referred in both query blocks. The rewritten query that use window functions looks like this:
WITH win AS (
    SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr, 
           s_address, s_phone, s_comment, ps_supplycost,
           MIN(ps_supplycost) OVER (PARTITION BY p_partkey) min_cost
    FROM partsupp, supplier, nation, region, part
    WHERE p_partkey = ps_partkey AND s_suppkey = ps_suppkey 
      AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey
      AND p_size = 27 AND p_type LIKE '%BRASS' AND r_name = 'EUROPE'
)    
SELECT s_acctbal, s_name, n_name, p_partkey, p_mfgr,
       s_address, s_phone, s_comment
FROM win
WHERE ps_supplycost = min_cost
ORDER BY s_acctbal DESC, n_name, s_name, p_partkey
LIMIT 100;

Here is a comparison of the execution time for the different possible query transformation for Q2:

For this query, using a lateral derived table is almost 20% slower than using a window function. This time, both using the variant with table push-down and using a window function are about 40% faster than the original query.

Conclusions

I have in this blog post shown that we can improve query performance by rewriting queries with correlated scalar subqueries to use window functions. For both example queries presented here, the performance when using a window function is on par with alternative ways to rewrite the query. Also, all the other alternatives are slower than using a window function for at least one of the queries.

At Alibaba, we have implemented the query transformation to window functions in our Innovative Version of POLARDB for MySQL. Unlike the subquery transformation introduced in MySQL 8.0.24, which is off by default since it may often significantly reduce the query performance, our transformation can be applied automatically. As long as the query satisfies the conditions listed above, it should not be any slower than the original query. In other words, with POLARDB you will no longer have to manually rewrite your query to get the benefits discussed in this blog post.

Friday, May 7, 2021

Transformations of Correlated Scalar Subqueries

MySQL 8.0.24 introduces transformation of correlated scalar subqueries to derived tables. In my presentation in the MySQL devroom at FOSDEM earlier this year, I showed how we can manually rewrite such queries. The automatic transformation in MySQL 8.0.24 is off by default, and in this blog post, I will discuss when it makes sense to turn it on. We will also look at how this automatic transformation compares to manual rewritten queries.

Example query

I will use the same query as in my FOSDEM presentation:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
WHERE p_mfgr = 'Manufacturer#1' 
  AND l_quantity < 
      (SELECT 0.2 * AVG(l_quantity) 
       FROM lineitem 
       WHERE l_partkey = p_partkey);
This query uses a scalar subquery to compute the average quantity that is ordered for a part, and this is used to find the average price per year of small orders (orders for less than 20% of the average quantity). The schema used for this query is from the TPC-H benchmark.

Here is the default query plan for this query as shown by EXPLAIN:

We see that the subquery is a dependent subquery. This means that the subquery will be executed for each combination of rows from part and lineitem tables that satisfies the condition on p_mfgr. According to EXPLAIN, there is on average 29 orders per part, so this means that the subquery will compute the same value 29 times for each part! This does not sound like an optimal plan!

Enabling the query transformation

We can enable the query transformation in MySQL 8.0.24 by turning on the optimizer switch subquery_to_derived:

EXPLAIN now shows that the subquery has been converted to a derived table. This means that the subquery is first materialized in a temporary table, and this table is joined with the tables of the outer query. In other words, we will now only compute the average quantity for each part once. The transformed version of the query is equivalent to the following query:
WITH pq(avg_qty, pk) AS (SELECT 0.2 * AVG(l_quantity), l_partkey 
                         FROM lineitem GROUP BY l_partkey)
SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
              JOIN pq ON pq.pk = p_partkey 
WHERE p_mfgr = 'Manufacturer#1' AND l_quantity < pq.avg_qty;
(Note that I, for readability, have used a Common Table Expression (CTE) instead of a derived table, but in this context they will be handled the same way.)

If we compare the performance for the two queries, using a TPC-H scale factor 10 database, we see that the transformed query runs in 5.5 minutes, while the original query takes almost 22 minutes. So the transformation makes the query 4 times faster in this particular case.

Note that while we now only compute the average quantity for each part once, we will compute the average for many parts that we are not interested in since we are only looking for parts from a specific manufacturer. This means that whether the query transformation will speed up the query or not, will depend on how much we save by computing each value only once compared to the extra work needed to compute values we do not need. In this case, 20% of the parts were from the specified manufacturer so the total work were reduced by applying the query transformation.

A slightly different query

Let's see what happens if we make the condition on the part table more selective:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN' 
  AND l_quantity < 
      (SELECT 0.2 * AVG(l_quantity) 
       FROM lineitem 
       WHERE l_partkey = p_partkey);
This query is Query 17 of the TPC-H benchmark, and here only 0.1% of the rows in the part table satisfies the WHERE condition. In this case, the original query takes 7.3 seconds while the transformed query takes 4.5 minutes. So here we would definitely prefer that the original query was used, and this explains why this transformation is not enabled by default.

Table push-down

As we have seen above, the transformation from scalar subquery to derived table would have been more efficient if we could make sure that the computations of the scalar value are only made for the rows that are relevant to the query. This could be achieved if the conditions from the outer query could be applied in the derived table. For Q17, since the conditions are on the part table, this would require that we could add the part table to the derived table. Since the subquery is correlated with the part table, we can do this and use the correlated condition as the join condition. Note that we can only add the table to the derived table if the correlation between the table and the scalar subquery is on a unique column. Otherwise, the join in the derived table may generate additional rows that may skew the aggregated value[1]. In our case, since the correlation is on p_partkey which is the primary key of part, it is safe to add part to the derived table. In addition, since the only information the outer query needs from the part table is p_partkey, we can remove part from the outer query:

WITH pq(avg_qty, pk) AS 
         (SELECT 0.2 * AVG(l_quantity), l_partkey 
          FROM lineitem JOIN part ON p_partkey = l_partkey 
          WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN'
          GROUP BY l_partkey) 
SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN pq ON pq.pk = l_partkey 
WHERE l_quantity < pq.avg_qty;  
This query runs in 1.4 seconds, more than 5 times faster than the original query. The query plan is as follows:

The steps of this query plan is:

  1. Full table scan of part
  2. For each row in part that satisfy the conditions on part (estimated 0.1%):
    1. Find matching rows in lineitem using the i_l_partkey index
    2. Compute the average quantity for matching rows
    3. Add the result to the temporary table <derived2>
  3. Scan the temporary table <derived2>
  4. For each row in the <derived2>:
    1. Find matching rows in lineitem using the i_l_partkey index
    2. Add l_extendedprice to the total sum for rows that satisfy the condition on l_quantity
As we can see from step 2, we will now only do index look-ups in lineitem and compute averages for "interesting" parts.

Lateral derived table

MySQL 8.0.14 added support for lateral derived tables. A lateral derived table differs from an ordinary derived table in that it can refer to columns of other tables in the FROM clause. Rewriting query 17 to use a lateral derived table is straight-forward:

SELECT SUM(l_extendedprice) / 7.0 AS avg_yearly 
FROM lineitem JOIN part ON p_partkey = l_partkey 
JOIN LATERAL (SELECT 0.2 * AVG(l_quantity) 
              FROM lineitem 
              WHERE l_partkey = p_partkey) ldt(avg_qty) 
WHERE p_brand = 'Brand#11' AND p_container = 'SM CAN' 
  AND l_quantity < avg_qty;

This query gets the following query plan:

This plan is basically the same as the query plan for the original query with one exception: We will evaluate the conditions on part before we join with lineitem. This is possible because by using a lateral derived table, we have decoupled the subquery from the condition on l_quantity. One thing to note about the EXPLAIN output is "Rematerialize (<derived2>)" on the first line. This means that for each row from part that qualifies, we will clear the temporary table, and rerun the correlated subquery. In other words, for this query, there will only be one row at a time in the temporary table.

For Query 17, using a lateral derived table is slightly faster than an ordinary derived table with table push-down (1.3 versus 1.4 seconds). For our first variant with a less selective WHERE condition, the difference is more significant (99 seconds versus 120 seconds). Why is the plan with a lateral derived table so much faster when both plans will access the same number of rows? An inspection with perf indicates that the final look-up into lineitem is more efficient for the query that use a lateral derived table. This is probably because we in that case will complete all query operations for one part before moving on to the next. The second look-up for a given part into the lineitem table will then be done right after the first. It is likely that this access pattern will result in less CPU cache misses.

Summary

The below diagram sums up the performance for the variants of the two queries that I have used in this blog post. Q17 is the original TPC-H Q17, while Q17big is the variant with a less selective WHERE clause. (Note that a logarithmic scale is used for the vertical axis to be able to present the results from the queries in the same diagram.)

While the query transformation introduced in MySQL 8.0, gives a good improvement for Q17big, it is disastrous for Q17. As we have seen, the usefulness of this query transformation depends on the selectivity of the WHERE clause. This automatic transformation will probably not be very useful to users unless the query optimizer is able to do a cost-based decision whether to use it or not. Also, for both queries there are manual rewrites that are much better. Particular, the use of lateral derived tables looks very promising, and, as we have seen, it is pretty straight-forward to rewrite a query to use a lateral derived table instead of a scalar correlated subquery.


[1]If the column is not unique, semi-join could be used. See Dreseler et. al., "Quantifying TPC-H Choke Points and Their Optimizations" for details.

Thursday, June 25, 2020

Memory Usage for MySQL Hash Join

At Percona Live Online my colleague at Alibaba, Jim Tommaney, presented a study of hash join performance in MySQL and Marwho?iaDB. For some of the queries, the peformance characteristics for MySQL seemed a bit strange, and I decided to look closer into what was going on. As you can read below, I found that the crucial point was the memory usage of hash join in MySQL.

Results presented at Percona Live ONLINE

In Jims presentation, you can see the following graphs for the performance of his query 2.2 on MySQL and Marwho?iaDB:


We see that for Marwho?iaDB, the performance scales pretty linear with increasing selectivity factor, while for MySQL there is a "step-wise" increase. The query is run on a scale factor 10 DBT-3 database, and looks like this:

SELECT /*+ NO_INDEX(orders) NO_INDEX(lineitem) */ COUNT(*)
FROM orders
JOIN lineitem ON l_orderkey = o_orderkey
WHERE o_custkey <= 1500000 * @selectivity;

We are joining two tables, orders with 15 million rows and lineitem with 60 million rows. The WHERE condition is on the smallest table, and the query is run with varying selectivity factors. For hash join to be used, we have to disable the indexes on the join keys. Otherwise, the query optimizer would choose to use indexed nested loop join. (I use the new optimizer hints introduced in MySQL 8.0.20 for this.)

The impact of join buffer size

The way hash join works is that it will build a hash-table of one of the inputs, the build input, which is usually the smallest of the two inputs, and scan through the other input, the probe input, using the hash table to find matches. In this case, the rows from the orders tables that satisfies the WHERE condition will be inserted into the hash table. This means that a larger selectiviy factor will require more memory for the hash table. In MySQL, the amount of memory available for the hash table is determined by the session variable join_buffer_size. If the hash table is larger than the join buffer, MySQL will split both inputs into chunks that are stored on disk, and then apply the in-memory hash join algorithm to each pair of chunks. For more on how hash join is implemented in MySQL, I recommend Erik Frøseth's great presentation at this year's FOSDEM.

In Jim's experiments, the size of the join buffer was 512 MB. This size was chosen in order for Marwho?iaDB to be able to use hash join for the queries. Marwho?iaDB does not support disk-based hash join, so the join buffer needs to be large enough to hold the entire build input. My first experiment was to run Jim's query 2.2 with different sizes for the join buffer. The results are displayed in this graph:

We see that if we use a join buffer of 2 GB, we get linear scalability for this query also for MySQL. So it seems that MySQL requires more memory than Marwho?iaDB for the same query. I have identified 3 reasons for this:

  1. Only approx. 2/3 of the join buffer will be used for in-memory hash join. This happens because MySQL will switch from in-memory to hybrid hash join when the whole join buffer has been allocated, not when it has been filled up. I have filed Bug#99933 for this issue.
  2. MySQL will add columns to the join buffer that is not needed for the rest of the query evaluation. In our example query, the column o_custkey is only needed for the evaluation of the WHERE condition. This condition will be evaluated before the rows of the orders are inserted into the hash table. Hence, this column is not needed for the hash join or afterwards and could be dropped. I have filed Bug#99934 for this issue.
  3. The hash table implementation requires more space. MySQL uses std::unordered_multimap to implement the hash table. This requires more space than the "home-made" table of linked lists used by Marwho?iaDB. Using the debugger, I found that, for this query, MySQL allocates 72 bytes per row, of which 48 is for the hash table entry, 16 bytes for the two columns, and 8 bytes for the hash value of the join key. The latter is used for the key of the multimap. (I do not think Marwho?iaDB stores the hash value in the join buffer; it will just be used to identify which of the linked list to check for a match.) Note that this query is a bit extreme since it actually only needs to store one integer column per row in the hash table. For most queries, one will typical need to store more columns in the join buffer; columns for grouping, ordering, comparisons or joins with other tables, in addition to the columns of the query result. So for a typical query the relative overhead of the hash table will be significantly smaller.

What is the ideal size for the join buffer?

Of course, if we have enough memory, you will get the best performance if you can store the entire build input in the join buffer. However, for large tables with much relevant data per row, this may not be feasible. A query may also need multiple active join buffers at the same time, e.g., building a new hash table while probing the previous, and multiple queries may run concurrently. What our test shows, is that if in-memory hash join is not possible, a medium size join buffer is often better than a larger one. Looking at this graph, we see that for hybrid hash join, a 256 MB join buffer performs much worse than many smaller sizes:

I have not looked into the details of why, but my guess is that you will get better performance from hash join if the join buffer can fit in the CPU cache. From the above graph, it seems that for this particular query, on this particular machine, a 16 MB join buffer will give the best general performance for hybrid hash join.

Conclusions

We have seen that MySQL hash join has some inefficiencies with respect to memory usage. Still, hash join provides a speed-up over nested loops join for many queries. There are more examples of this in Jim's presentation. Unlike Marwho?iaDB, MySQL supports using disk to split up the hash join. That way, it is possible to efficiently join large tables without using much extra memory.

Saturday, October 12, 2019

MySQL Parallel Index Range Scans: One is Sometimes Better Than Two

As presented at the Percona Live Conference in Austin in May, at Alibaba we are working on adding support for parallel query execution to POLARDB for MySQL. As discussed in the presentation, we observe that parallel range scans on secondary indexes does not scale very well in MySQL. This is old news. Mark Callaghan reported this several years ago, and this does not seem to have changed. In this blog post, I will investigate what effects using a multi-processor machine have on the scalability of parallel index range scans.

I will run a variant of Query 6 of the TPC-H/DBT-3 benchmark. While the original Query 6 sums up the revenue for a whole year, my version will only compute it over a single month:

SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= '1997-01-01'
  AND l_shipdate < DATE_ADD('1997-01-01', INTERVAL '1' MONTH)
  AND l_discount BETWEEN 0.05 - 0.01 AND 0.05 + 0.01
  AND l_quantity < 25;

This query was run by multiple parallel threads using sysbench on MySQL 8.0.17. For each query execution, a random month was picked. The machine used have 2 processors with 16 physical cores each, so with hyper-threading there are in total 64 virtual CPUs. I used a DBT-3 scale factor 1 database, and with a 4 GB InnoDB buffer pool, all data was cached in memory. MySQL will use the secondary index on l_shipdate when executing this query. For comparison, I also ran the same query when forcing MySQL to use table scan instead of index range scan. The results are presented in this graph:

The throughput when using the secondary index is of course higher than for table scan, since we only need to read data for 1 month instead of for all 84 months when using a full table scan. However, we see that while the table scan scales almost linearly up to 32 threads, this is not the case for the index range scan. For 32 threads, the throughput with table scans is more than 30 times higher than with 1 thread. For index range scan, the increase is only about 7.8x and 9.5x for 32 and 64 threads, respectively.

We can see that for table scan, there is no benefit from hyper-threading; the maximum throughput is reached when having one thread per physical core. In other words, each thread doing table scan is able to execute without any significant interrupts. For index range scans, on the other hand, we see that we get higher throughput with 64 threads than with 32 threads. This is an indication that the threads are regularly stalled, and there is a benefit from having other threads that can be scheduled when this happens.

So why are the threads stalled? I think the most likely reason is that there is a synchronization bottleneck related to non-covering secondary index scans. Further investigations are needed to understand where, but my current guess is that it is related to accessing the root page of the B-tree when looking up the row corresponding to the index entry. (I have turned off the adaptive hash index (AHI) when running these tests, so all primary key look-ups will have to go through the root page. When using the AHI, the scaling seem to be even worse, but that is different story ...)

When there is a lot of thread synchronization, running on multiple processors may increase our problems since the state of the mutex, or whatever is used for synchronization, will have to be synchronized between the caches of the CPUs. To investigate what effects this have for our case, I will use Resource Groups to make MySQL use only one processor for this query. First, I will create a resource group, cpu1, that contains the virtual CPUs of processor 1:

CREATE RESOURCE GROUP cpu1 TYPE=user VCPU=16-31,48-63;

To make our query use this resource group, we add a hint to the query:

SELECT /*+ RESOURCE_GROUP(cpu1) */ 
       SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= '1997-01-01'
  AND l_shipdate < DATE_ADD('1997-01-01', INTERVAL '1' MONTH)
  AND l_discount BETWEEN 0.05 - 0.01 AND 0.05 + 0.01
  AND l_quantity < 25;

We repeat the sysbench runs with this query, and compare the results for index range scans:

We see that using only the cores of one processor, increases the performance. The maximum throughput is increased by almost 30% by using one processor instead of two! The "scale factors" are now 10.3x and 12.6x for 16 and 32 threads, respectively. Much better, but still a bit away from perfect scaling. Hence, there is still a need for further investigations. Stay tuned!

Tuesday, May 28, 2019

My presentations at Percona Live

Percona Live 2019 is starting in Austin today.  You may already have read that I have a presentation on how to use Optimizer Trace to understand the inner workings of the MySQL Query Optimizer: The MySQL Query Optimizer Explained Through Optimizer Trace on Wednesday at 2:55pm.
I hope to see many of you there!

Together with my colleague Benny Wang, I will also be presenting the current work we do at Alibaba to add parallel query processing to PolarDB for MySQL.  If you want to learn more about our work, come to the presentation on Thursday at 11:00am.

Many other of my colleagues at Alibaba have  presentations on the MySQL related work we do for Alibaba Cloud.  Please, check out the AliSQL & POLARDB track on Thursday!

Wednesday, January 30, 2019

Inspecting the Content of a MySQL Histogram

FOSDEM is coming up. I do not have a presentation in the MySQL and Friends devroom this year, but it reminded me that I had planned to post a follow-up to my presentation from last year.

As part of the presentation, I showed how you can inspect the content of a histogram using the information schema table column_statistics. For example, the following query will show the content of the histogram for the column l_linenumber in the table lineitem of the dbt3_sf1 database:

SELECT JSON_PRETTY(histogram)
FROM information_schema.column_statistics
WHERE schema_name = 'dbt3_sf1'
 AND table_name ='lineitem'
 AND column_name = 'l_linenumber';

The histogram is stored as a JSON document:

{
 "buckets": [[1, 0.24994938524948698], [2, 0.46421066400720523],
 [3, 0.6427401784471978], [4, 0.7855470933802572],
 [5, 0.8927398868395817], [6, 0.96423707532558], [7, 1] ],
 "data-type": "int",
 "null-values": 0.0,
 "collation-id": 8,
 "last-updated": "2018-02-03 21:05:21.690872",
 "sampling-rate": 0.20829115437457252,
 "histogram-type": "singleton",
 "number-of-buckets-specified": 1024
}

The distribution of values can be found in the buckets array of the JSON document. In the above case, the histogram type is singleton. That means that each bucket contains the frequency of a single value. For the other type of histogram, equi-height, each bucket will contain the minimum and maximum value for the range covered by the bucket. The frequency value recorded, is the cumulative frequency. That is, it gives the frequency of values smaller than the maximum value of the bucket. In the example above, 64.27% of the values in the l_linenumber column is less than or equal to 3.

In other words, if you have created a histogram for a column, you can query the information schema table to get estimates on column values. This will normally be much quicker than to get an exact result by querying the actual table.

As discussed in my FOSDEM presentation, string values are base64 encoded in the histogram. At the time of the presentation, using MySQL 8.0.11, it was a bit complicated to decode these string values. However, from MySQl 8.0.12 on, this has become simpler. As stated in the release notes for MySQL 8.0.12:

The JSON_TABLE() function now automatically decodes base-64 values and prints them using the character set given by the column specification.

JSON_TABLE is a table function that will convert a JSON array to a relational table with one row per element of the array. We can use JSON_TABLE to extract the buckets of the histogram into a relational table:

SELECT v value, c cumulfreq
FROM information_schema.column_statistics,
     JSON_TABLE(histogram->'$.buckets', '$[*]'
                COLUMNS(v VARCHAR(60) PATH '$[0]',
                        c double PATH '$[1]')) hist
WHERE schema_name = 'dbt3_sf1'
  AND table_name ='orders'
  AND column_name = 'o_orderstatus';
Running the above query on my DBT3 database, I get the following result:
+-------+---------------------+
| value | cumulfreq           |
+-------+---------------------+
| F     | 0.48544670343055835 |
| O     |  0.9743427900693199 |
| P     |                   1 |
+-------+---------------------+

The above gives the cumulative frequencies. Normally, I would rather want to see the actual frequencies of each value, and to get that I will need to subtract the value of the previous row. We can use a window function to do that:

mysql> SELECT v value, c cumulfreq,  c - LAG(c, 1, 0) OVER () freq
    -> FROM information_schema.column_statistics,
    ->      JSON_TABLE(histogram->'$.buckets', '$[*]'
    ->          COLUMNS(v VARCHAR(60) PATH '$[0]',
    ->                    c double PATH '$[1]')) hist
    -> WHERE schema_name = 'dbt3_sf1'
    ->   AND table_name ='orders'
    ->   AND column_name = 'o_orderstatus';
+-------+---------------------+----------------------+
| value | cumulfreq           | freq                 |
+-------+---------------------+----------------------+
| F     | 0.48544670343055835 |  0.48544670343055835 |
| O     |  0.9743427900693199 |  0.48889608663876155 |
| P     |                   1 | 0.025657209930680103 |
+-------+---------------------+----------------------+
3 rows in set (0.00 sec)

So by combining three new features in MySQL 8.0, histogram, JSON_TABLE, and window functions, I am able to quickly get an estimate for the frequencies of the possible values for my column.