Thursday, May 5, 2011

InnoDB Persistent Statistics Save the Day

In my previous blog posting, I explained how I was able to get more stable query execution times by increasing the amount of sampling used to by InnoDB to calculate statistics. However, for my example query, Query 8 of the DBT-3 benchmark, the MySQL Optimizer still toggled between three different indexes to use when accessing one of the 8 tables.

I decided to try out InnoDB Persistent Statistics that is one of the new features in the recent MySQL 5.6.2 Development Milestone Release. According to the advertisement, this should give both more accurate and more stable statistics. The improved accuracy is achieved by using a more precise sampling algorithm, while the increased stability is achieved by giving the user full control over when statistics are recalculated. That is, the persistent statistics are not recalculated automatically.

In order to activate persistent statistics, you first need to run a script to create the tables to be used to store the statistics. Then, you can enable the feature by setting the system variable innodb_analyze_is_persistent. For the details, see the MySQL 5.6 Manual.

More Accurate Statistics

To investigate whether the new sampling algorithm gives more accurate statistics, I looked at the average and variance in the estimated cardinality of indexed columns over 100 runs of ANALYZE on the DBT-3 tables. The average of the estimates did not change much as it is actually pretty spot-on with the old sampling algorithm. It is the great variance that causes the observed inaccuracies.

The below chart shows the coefficient of variation for the estimated cardinality. (Columns that are listed more than once, are part of multiple indexes. If a bar in the bar chart is not visible, it means that there was no or very little variance between runs for this column.) The bar chart clearly shows that for most indexes, the sampling algorithm for persistent statistics gives less variance than the old sampling algorithm used with transient statistics. (The number of sampled index pages was 100 in both cases.)


More Stable Statistics

As mentioned above, the persistent statistics will be more stable because it is not updated automatically like the old transient statistics. This way, for our performance regression test, we can run ANALYZE once for each table, and all later runs on this database will use the same statistics. Another advantage of being able to control when statistics are recalculated, is that it can be scheduled at times when one can afford to use a higher sampling rate. The disadvantage is that it becomes the burden of the DBA to make sure to regularly initiate a recalculation of statistics to prevent that the statistics become outdated.

InnoDB Statistics Tables

The InnoDB statistics tables are ordinary tables that are created in a database called innodb. There are two tables: table_stats and index_stats that contain per-table and per-index statistics, respectively:

mysql> describe innodb.table_stats;
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
| Field                    | Type                | Null | Key | Default           | Extra                       |
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
| database_name            | varchar(64)         | NO   | PRI | NULL              |                             |
| table_name               | varchar(64)         | NO   | PRI | NULL              |                             |
| stats_timestamp          | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| n_rows                   | bigint(20) unsigned | NO   |     | NULL              |                             |
| clustered_index_size     | bigint(20) unsigned | NO   |     | NULL              |                             |
| sum_of_other_index_sizes | bigint(20) unsigned | NO   |     | NULL              |                             |
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
6 rows in set (0.01 sec)

mysql> describe innodb.index_stats;
+------------------+---------------------+------+-----+-------------------+-----------------------------+
| Field            | Type                | Null | Key | Default           | Extra                       |
+------------------+---------------------+------+-----+-------------------+-----------------------------+
| database_name    | varchar(64)         | NO   | PRI | NULL              |                             |
| table_name       | varchar(64)         | NO   | PRI | NULL              |                             |
| index_name       | varchar(64)         | NO   | PRI | NULL              |                             |
| stat_timestamp   | timestamp           | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| stat_name        | varchar(64)         | NO   | PRI | NULL              |                             |
| stat_value       | bigint(20) unsigned | NO   |     | NULL              |                             |
| sample_size      | bigint(20) unsigned | YES  |     | NULL              |                             |
| stat_description | varchar(1024)       | NO   |     | NULL              |                             |
+------------------+---------------------+------+-----+-------------------+-----------------------------+
8 rows in set (0.00 sec)

table_stats contains one row per table:
mysql> select * from innodb.table_stats where table_name='customer';
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| database_name | table_name | stats_timestamp     | n_rows | clustered_index_size | sum_of_other_index_sizes |
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| dbt3          | customer   | 2011-04-01 20:53:30 | 149911 |                 1764 |                      225 |
+---------------+------------+---------------------+--------+----------------------+--------------------------+

index_stats contains several rows per index, and the column stat_name identifies the type of statistics as described by the content of the stat_description column. The n_diff_pfx_ rows contain the cardinality statistics: For rows with stat_name = 'n_diff_pfx_01', stat_value contains the number of distinct values for the first column of the given index. For rows with stat_name = 'n_diff_pfx_02', stat_value contains the number of unique combinations of the two first columns of the given index, and so on. (Note that InnoDB indexes, in addition to the columns that were specified when the index was created, contain all columns of the primary key.)
mysql> select * from innodb.index_stats where table_name='customer';
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name    | stat_timestamp      | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx01 |         25 |         171 | c_nationkey                       |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx02 |     150000 |         171 | c_nationkey,c_custkey             |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | n_leaf_pages |        171 |        NULL | Number of leaf pages in the index |
| dbt3          | customer   | i_c_nationkey | 2011-04-01 11:40:29 | size         |        225 |        NULL | Number of pages in the index      |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | n_diff_pfx01 |     149911 |         100 | c_custkey                         |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | n_leaf_pages |       1746 |        NULL | Number of leaf pages in the index |
| dbt3          | customer   | PRIMARY       | 2011-04-01 11:40:29 | size         |       1764 |        NULL | Number of pages in the index      |
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+

Manually Updating InnoDB Statistics Tables

Since the statistics tables are normal tables, they can be updated like any other MySQL table. For our performance regression tests, we have decided to record our own numbers in these tables. That way, we will be able to get the exact same numbers even if we for some reason need to recreate the statistics tables (e.g, in another database instance).

When we manually update the statistics tables for the DBT-3 benchmarks, we first turn on persistent statistics and run ANALYZE once for each table. This inserts all necessary rows in the statistics tables, and we will only need to update the relevant rows. Here are an example UPDATE statement for each table:
UPDATE innodb.table_stats SET n_rows=150000 
  WHERE database_name='dbt3' AND table_name='customer' ;
UPDATE innodb.index_stats SET stat_value=25, sample_size=NULL 
  WHERE database_name='dbt3' AND table_name='customer' 
    AND index_name='i_c_nationkey' AND stat_name='n_diff_pfx01';
By convention, I set index_stats.sample_size to NULL to indicate that the value is recorded and not computed by sampling.

A World of New Possibilities

As discussed above, InnoDB Persistent Statistics will be very useful in order to ensure that the same query execution plan is used every time a query is executed. It also gives the users the ability to control when the statistics are recalculated. However, persistent statistics opens up for even more new possibilities.

We will be able to use this feature to extend our testing of the query optimizer. By "faking" the statistics, we will be able to explore and test the execution of different query plans without having to actually modify the database.

It will also be possible for users to change the statistics in order to force a specific query plan. However, one risks introducing side-effects on other queries so this will have to be done with caution.