Monday, October 3, 2011

Batched Key Access Speeds Up Disk-Bound Join Queries

A new feature in MySQL 5.6.3 Development Milestone Release is Batched Key Access (BKA). BKA can be applied when an index lookup can be used to execute a join query. One example of such a query is:

mysql> SELECT * FROM customer JOIN orders ON c_custkey = o_custkey;

Given that the customer table is significantly smaller than the orders table, and assuming that we have an index on the o_custkey column of orders, MySQL have traditionally executed this join as follows: Scan the entire customer table, and for each row in the customer table, do an index look-up into the orders table to find matching rows. This strategy is know as Index Nested Loops Join, or as MySQL EXPLAIN puts it:

| id | select_type | table    | type | possible_keys | key         | key_len | ref                     | rows   | Extra |
|  1 | SIMPLE      | customer | ALL  | PRIMARY       | NULL        | NULL    | NULL                    | 150000 |       |
|  1 | SIMPLE      | orders   | ref  | i_o_custkey   | i_o_custkey | 5       | dbt3.customer.c_custkey |      7 |       |

If BKA is applied, MySQL will instead buffer keys to be used for index look-up in the join buffer, and each time the join buffer is full, it will do a batched look-up on the index. That is, the whole set of keys from the join buffer is sent to the storage engine in one batch. For this purpose, MySQL Server uses the Multi-Range Read interface of the Storage Engine API.

Multi-Range Read (MRR)

The advantage of requesting a batch of rows from the storage engine, is that the storage engine may be able to deliver the rows more efficiently that way. The MRR interface has for some time been used in MySQL Cluster since it reduces communication between nodes by making it possible to request more than one row per round-trip.

A new feature in MySQL 5.6.2 was Disk-Sweep Multi-Range Read (DS-MRR) optimizations for both InnoDB and MyISAM. When DS-MRR is applied, the storage engine will access the data rows in the order given by the base table; instead of in the order given by the index. For a disk-bound query this has two advantages:
  1. All interesting rows on the same disk page will be accessed together. Hence, one do not risk that a page has been removed from the buffer pool between two accesses to the page.
  2. Even if modern disks/file systems do not give any guarantees, accessing the pages in table order, will normally give close to sequential access to the disk. As we know, that should give much better performance compared to random access. (If multiple sessions access the disks simultaneously, the advantage of sequential access within a session will of course be less significant.)

Turning On Batched Key Access

While BKA is advantageous in a disk-bound system, it may actually reduce performance in a CPU-bound setting. (More on this below.) Since the MySQL Optimizer currently has no information on whether a table resides in memory or on disk, BKA is off by default.

You can turn on BKA by setting an optimizer_switch flag. Since BKA depends on MRR, the MRR flag also needs to be on. (This is default.) In addition, since the cost model for MRR is currently way too conservative, in order to use MRR, cost-based MRR optimization needs to be turned off. In other words, the following settings must be done in order for the Optimizer to consider BKA:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

With BKA turned on, EXPLAIN for the query presented earlier, will look like this:

| id | select_type | table    | type | possible_keys | key         | key_len | ref                     | rows   | Extra                                  |
|  1 | SIMPLE      | customer | ALL  | PRIMARY       | NULL        | NULL    | NULL                    | 150000 |                                        |
|  1 | SIMPLE      | orders   | ref  | i_o_custkey   | i_o_custkey | 5       | dbt3.customer.c_custkey |      7 | Using join buffer (Batched Key Access) |

Performance Impact of BKA on DBT-3 Queries

The chart below shows query execution times, with and without BKA, for all the queries of the DBT-3 benchmark where BKA can be used. The size of the database is DBT-3 scale 1 (4 GB), and in order to get disk bound queries, the size of the InnoDB buffer pool is just 50 MB. We see that, for most of the queries, the run time with BKA is around half the run time without BKA .

Note that Query 17 is an example of a query where it is not beneficial to use BKA even when the query is disk-bound. In that particular case, locality of accesses are very good without BKA, and applying BKA actually spreads accesses to related rows in time, decreasing the buffer cache hit ratio.

Effects of Increasing the Join Buffer Size

As mention above, the size of the batches used by BKA is determined by the size of the join buffer. It follows from the discussion above that the larger the batch, the more efficiently the storage engine may arrange its disk accesses. The results presented above were achieved with a default setting of join_buffer_size (128 kB). We will now look closer at what happens if we increase the size of the join buffer. For this purpose, we will use Query 13 (Customer Distribution Query) from the DBT-3 benchmark:

SELECT c_count, COUNT(*) AS custdist
      SELECT c_custkey, COUNT(o_orderkey) AS c_count
      FROM customer LEFT OUTER JOIN orders ON
        c_custkey = o_custkey AND o_comment NOT LIKE '%express%requests%'
      GROUP BY c_custkey
     ) AS c_orders 
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

The heart of this query is the join between customer table and orders table:

| id | select_type | table      | type  | possible_keys | key           | key_len | ref                     | rows   | Extra                                               |
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL          | NULL    | NULL                    | 1050000 | Using temporary; Using filesort                     |
|  2 | DERIVED     | customer   | index | NULL          | i_c_nationkey | 5       | NULL                    |  150000 | Using index; Using temporary; Using filesort        |
|  2 | DERIVED     | orders     | ref   | i_o_custkey   | i_o_custkey   | 5       | dbt3.customer.c_custkey |       7 | Using where; Using join buffer (Batched Key Access) |

Using BKA for this query, MySQL will fill the join buffer with customer keys and send a batch of keys to the storage engine. The storage engine will find the primary keys for the requested rows in the i_o_custkey index, sort the primary keys, and access the orders table in primary key order. In other words, there will be one pass over the orders table for each batch. The graph below shows the effect of increasing the join buffer on Query 13. (Note the logarithmic scale of the y-axis.) The query execution time goes down from 20 minutes with a default join buffer to less than 10 secs with a 32 MB join buffer!

BKA Performance When Queries are CPU-bound

As discussed above, BKA is designed to speed up disk-bound queries. Now we will look at what happens if BKA is used in a system where all the data in the database reside in main memory. The chart below presents the results of running the same DBT-3 queries with a InnoDB buffer pool of 5 GB. While the difference is not very significant for most queries, we see there are a few queries where the impact of BKA is quite negative.

The negative impact of BKA is not just the overhead of sorting the keys before accessing the table. If you look at the execution plan for QUERY 13 as presented above, it shows that the result of the join needs to be sorted in order to perform the group by operation of the subquery. This would not have been necessary, had not BKA been used. Without BKA, the result from the join would have been delivered in customer key order and grouping could be done without sorting. Hence, in this experiment, the execution time for Query 13 increased from 3.6 to 4.8 seconds when using BKA.

Use with Care

As I have shown, join query execution can benefit significantly from the use of Batched Key Access. However, there is also examples of queries where BKA do more harm than good. Hence, it should be used with care. My advice is to study the BKA performance for your typical queries and to only turn on BKA for queries that are known to benefit from it.

Thursday, May 5, 2011

InnoDB Persistent Statistics Save the Day

In my previous blog posting, I explained how I was able to get more stable query execution times by increasing the amount of sampling used to by InnoDB to calculate statistics. However, for my example query, Query 8 of the DBT-3 benchmark, the MySQL Optimizer still toggled between three different indexes to use when accessing one of the 8 tables.

I decided to try out InnoDB Persistent Statistics that is one of the new features in the recent MySQL 5.6.2 Development Milestone Release. According to the advertisement, this should give both more accurate and more stable statistics. The improved accuracy is achieved by using a more precise sampling algorithm, while the increased stability is achieved by giving the user full control over when statistics are recalculated. That is, the persistent statistics are not recalculated automatically.

In order to activate persistent statistics, you first need to run a script to create the tables to be used to store the statistics. Then, you can enable the feature by setting the system variable innodb_analyze_is_persistent. For the details, see the MySQL 5.6 Manual.

More Accurate Statistics

To investigate whether the new sampling algorithm gives more accurate statistics, I looked at the average and variance in the estimated cardinality of indexed columns over 100 runs of ANALYZE on the DBT-3 tables. The average of the estimates did not change much as it is actually pretty spot-on with the old sampling algorithm. It is the great variance that causes the observed inaccuracies.

The below chart shows the coefficient of variation for the estimated cardinality. (Columns that are listed more than once, are part of multiple indexes. If a bar in the bar chart is not visible, it means that there was no or very little variance between runs for this column.) The bar chart clearly shows that for most indexes, the sampling algorithm for persistent statistics gives less variance than the old sampling algorithm used with transient statistics. (The number of sampled index pages was 100 in both cases.)

More Stable Statistics

As mentioned above, the persistent statistics will be more stable because it is not updated automatically like the old transient statistics. This way, for our performance regression test, we can run ANALYZE once for each table, and all later runs on this database will use the same statistics. Another advantage of being able to control when statistics are recalculated, is that it can be scheduled at times when one can afford to use a higher sampling rate. The disadvantage is that it becomes the burden of the DBA to make sure to regularly initiate a recalculation of statistics to prevent that the statistics become outdated.

InnoDB Statistics Tables

The InnoDB statistics tables are ordinary tables that are created in a database called innodb. There are two tables: table_stats and index_stats that contain per-table and per-index statistics, respectively:

mysql> describe innodb.table_stats;
| Field                    | Type                | Null | Key | Default           | Extra                       |
| database_name            | varchar(64)         | NO   | PRI | NULL              |                             |
| table_name               | varchar(64)         | NO   | PRI | NULL              |                             |
| stats_timestamp          | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| n_rows                   | bigint(20) unsigned | NO   |     | NULL              |                             |
| clustered_index_size     | bigint(20) unsigned | NO   |     | NULL              |                             |
| sum_of_other_index_sizes | bigint(20) unsigned | NO   |     | NULL              |                             |
6 rows in set (0.01 sec)

mysql> describe innodb.index_stats;
| Field            | Type                | Null | Key | Default           | Extra                       |
| database_name    | varchar(64)         | NO   | PRI | NULL              |                             |
| table_name       | varchar(64)         | NO   | PRI | NULL              |                             |
| index_name       | varchar(64)         | NO   | PRI | NULL              |                             |
| stat_timestamp   | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| stat_name        | varchar(64)         | NO   | PRI | NULL              |                             |
| stat_value       | bigint(20) unsigned | NO   |     | NULL              |                             |
| sample_size      | bigint(20) unsigned | YES  |     | NULL              |                             |
| stat_description | varchar(1024)       | NO   |     | NULL              |                             |
8 rows in set (0.00 sec)

table_stats contains one row per table:
mysql> select * from innodb.table_stats where table_name='customer';
| database_name | table_name | stats_timestamp     | n_rows | clustered_index_size | sum_of_other_index_sizes |
| dbt3          | customer   | 2011-04-01 20:53:30 | 149911 |                 1764 |                      225 |

index_stats contains several rows per index, and the column stat_name identifies the type of statistics as described by the content of the stat_description column. The n_diff_pfx_ rows contain the cardinality statistics: For rows with stat_name = 'n_diff_pfx_01', stat_value contains the number of distinct values for the first column of the given index. For rows with stat_name = 'n_diff_pfx_02', stat_value contains the number of unique combinations of the two first columns of the given index, and so on. (Note that InnoDB indexes, in addition to the columns that were specified when the index was created, contain all columns of the primary key.)
mysql> select * from innodb.index_stats where table_name='customer';
| database_name | table_name | index_name    | stat_timestamp      | stat_name    | stat_value | sample_size | stat_description                  |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx01 |         25 |         171 | c_nationkey                       |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx02 |     150000 |         171 | c_nationkey,c_custkey             |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_leaf_pages |        171 |        NULL | Number of leaf pages in the index |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | size         |        225 |        NULL | Number of pages in the index      |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | n_diff_pfx01 |     149911 |         100 | c_custkey                         |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | n_leaf_pages |       1746 |        NULL | Number of leaf pages in the index |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | size         |       1764 |        NULL | Number of pages in the index      |

Manually Updating InnoDB Statistics Tables

Since the statistics tables are normal tables, they can be updated like any other MySQL table. For our performance regression tests, we have decided to record our own numbers in these tables. That way, we will be able to get the exact same numbers even if we for some reason need to recreate the statistics tables (e.g, in another database instance).

When we manually update the statistics tables for the DBT-3 benchmarks, we first turn on persistent statistics and run ANALYZE once for each table. This inserts all necessary rows in the statistics tables, and we will only need to update the relevant rows. Here are an example UPDATE statement for each table:
UPDATE innodb.table_stats SET n_rows=150000 
  WHERE database_name='dbt3' AND table_name='customer' ;
UPDATE innodb.index_stats SET stat_value=25, sample_size=NULL 
  WHERE database_name='dbt3' AND table_name='customer' 
    AND index_name='i_c_nationkey' AND stat_name='n_diff_pfx01';
By convention, I set index_stats.sample_size to NULL to indicate that the value is recorded and not computed by sampling.

A World of New Possibilities

As discussed above, InnoDB Persistent Statistics will be very useful in order to ensure that the same query execution plan is used every time a query is executed. It also gives the users the ability to control when the statistics are recalculated. However, persistent statistics opens up for even more new possibilities.

We will be able to use this feature to extend our testing of the query optimizer. By "faking" the statistics, we will be able to explore and test the execution of different query plans without having to actually modify the database.

It will also be possible for users to change the statistics in order to force a specific query plan. However, one risks introducing side-effects on other queries so this will have to be done with caution.

Thursday, April 14, 2011

More Stable Query Execution Times by Improving InnoDB Statistics

As part of ensuring that changes to the MySQL Optimizer does not introduce performance regressions, we have started a project to create a new performance regression test suite. One component in this test suite will be the DBT-3 test suite. However, we observed that the execution times for DBT-3 varied so much that, in its present form, it was not usable for detecting performance regressions.

In order to get a better understanding of what was going on, I looked closer at one of the queries that were run, Query 8. For this particular query, which contains an 8-table join, the execution times varied from 1 minute to 5 hours! Looking at 100 runs of this query, I detected 8 different query execution plans. Four of these plans represented differences in which sequence the tables were joined, while the last four differed from the first four with respect to which indexes were used.

Since the MySQL Optimizer is cost based, when the execution plans vary, this is usually because the underlying statistics reported from the storage engine varies. One way to investigate the statistics that the optimizer bases its decisions on, is to use the SHOW INDEX command:

mysql> show index from customer;
| Table    | Non_unique | Key_name      | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
| customer |          0 | PRIMARY       |            1 | c_custkey   | A         |      150000 |     NULL | NULL   |      | BTREE      |         |               |
| customer |          1 | i_c_nationkey |            1 | c_nationkey | A         |          47 |     NULL | NULL   | YES  | BTREE      |         |               |

In the above example, the customer table has two indexes, a primary index on c_custkey and a secondary index on c_nationkey. The most important thing to note from this output, is the estimated number of different key values, cardinality. For the two indexed columns of the customer table, the cardinality is 150,000 and 47, respectively.

The InnoDB statistics are calculated on-the-fly the first time a table is used after the server has been started. The statistics may be automatically recalculated at various times, and ANALYZE TABLE can be used to force a recalculation. For a description of how InnoDB calculates its statistics, see the MySQL 5.5 Reference Manual. The important thing to note is that there is a system variable, innodb_stats_sample_pages, that controls the accuracy of the statistics. It determines the number of index pages that are sampled in order to calculate the statistics. The default value is 8.

Continuing with the example above, I ran ANALYZE on the customer table, and here is what I got:

mysql> analyze table customer;
| Table         | Op      | Msg_type | Msg_text |
| dbt3.customer | analyze | status   | OK       |
1 row in set (0.03 sec)

mysql> show index from customer;
| Table    | Non_unique | Key_name      | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
| customer |          0 | PRIMARY       |            1 | c_custkey   | A         |      150000 |     NULL | NULL   |      | BTREE      |         |               |
| customer |          1 | i_c_nationkey |            1 | c_nationkey | A         |         134 |     NULL | NULL   | YES  | BTREE      |         |               |
2 rows in set (0.01 sec)

As the observant reader has already noticed, the estimated cardinality of the column c_nationkey has changed significantly. Running ANALYZE several times, I saw numbers as low as 5 and as high as 135 for this column.

The question is then whether we can get better and more stable statistics by increasing the setting of innodb_stats_sample_pages. I increased the value to 100, and ran ANALYZE 100 times. This time, the estimated cardinality was between 22 and 84, and 80% of the time it was between 30 and 60.

So how did this affect the execution plan for query 8? Running the query another 100 times with this new setting of innodb_stats_sample_pages, gave only 3 different execution plans, all with the same join ordering. The remaining difference was in which index was used to access one of the tables. I will discuss how to further reduce the number of plans in a future blog post.

Note that increasing innodb_stats_sample_pages will cause the calculation of statistics to take longer time. In my case, sampling 100 pages did not seem to take a noticeable amount of time, and this should normally not cause any problems when ANALYZE TABLE is run explicitly. However, if InnoDB decides to do an automatic recalculation, the increased time may have some unwanted impact. Hence, this variable should be used with care, and its value should not be set higher than necessary to get reasonably accurate estimates.