I decided to try out InnoDB Persistent Statistics that is one of the new features in the recent MySQL 5.6.2 Development Milestone Release. According to the advertisement, this should give both more accurate and more stable statistics. The improved accuracy is achieved by using a more precise sampling algorithm, while the increased stability is achieved by giving the user full control over when statistics are recalculated. That is, the persistent statistics are not recalculated automatically.

In order to activate persistent statistics, you first need to run a script to create the tables to be used to store the statistics. Then, you can enable the feature by setting the system variable

`innodb_analyze_is_persistent`. For the details, see the MySQL 5.6 Manual.

**More Accurate Statistics**

To investigate whether the new sampling algorithm gives more accurate statistics, I looked at the average and variance in the estimated cardinality of indexed columns over 100 runs of ANALYZE on the DBT-3 tables. The average of the estimates did not change much as it is actually pretty spot-on with the old sampling algorithm. It is the great variance that causes the observed inaccuracies.

The below chart shows the coefficient of variation for the estimated cardinality. (Columns that are listed more than once, are part of multiple indexes. If a bar in the bar chart is not visible, it means that there was no or very little variance between runs for this column.) The bar chart clearly shows that for most indexes, the sampling algorithm for persistent statistics gives less variance than the old sampling algorithm used with transient statistics. (The number of sampled index pages was 100 in both cases.)

**More Stable Statistics**

As mentioned above, the persistent statistics will be more stable because it is not updated automatically like the old transient statistics. This way, for our performance regression test, we can run ANALYZE once for each table, and all later runs on this database will use the same statistics. Another advantage of being able to control when statistics are recalculated, is that it can be scheduled at times when one can afford to use a higher sampling rate. The disadvantage is that it becomes the burden of the DBA to make sure to regularly initiate a recalculation of statistics to prevent that the statistics become outdated.

**InnoDB Statistics Tables**

The InnoDB statistics tables are ordinary tables that are created in a database called

`innodb`. There are two tables:

`table_stats`and

`index_stats`that contain per-table and per-index statistics, respectively:

```
mysql> describe innodb.table_stats;
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
| database_name | varchar(64) | NO | PRI | NULL | |
| table_name | varchar(64) | NO | PRI | NULL | |
| stats_timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| n_rows | bigint(20) unsigned | NO | | NULL | |
| clustered_index_size | bigint(20) unsigned | NO | | NULL | |
| sum_of_other_index_sizes | bigint(20) unsigned | NO | | NULL | |
+--------------------------+---------------------+------+-----+-------------------+-----------------------------+
6 rows in set (0.01 sec)
mysql> describe innodb.index_stats;
+------------------+---------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+---------------------+------+-----+-------------------+-----------------------------+
| database_name | varchar(64) | NO | PRI | NULL | |
| table_name | varchar(64) | NO | PRI | NULL | |
| index_name | varchar(64) | NO | PRI | NULL | |
| stat_timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| stat_name | varchar(64) | NO | PRI | NULL | |
| stat_value | bigint(20) unsigned | NO | | NULL | |
| sample_size | bigint(20) unsigned | YES | | NULL | |
| stat_description | varchar(1024) | NO | | NULL | |
+------------------+---------------------+------+-----+-------------------+-----------------------------+
8 rows in set (0.00 sec)
```

`table_stats`contains one row per table:

```
mysql> select * from innodb.table_stats where table_name='customer';
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| database_name | table_name | stats_timestamp | n_rows | clustered_index_size | sum_of_other_index_sizes |
+---------------+------------+---------------------+--------+----------------------+--------------------------+
| dbt3 | customer | 2011-04-01 20:53:30 | 149911 | 1764 | 225 |
+---------------+------------+---------------------+--------+----------------------+--------------------------+
```

`index_stats`contains several rows per index, and the column

`stat_name`identifies the type of statistics as described by the content of the

`stat_description`column. The

`n_diff_pfx_`rows contain the cardinality statistics: For rows with

`stat_name = 'n_diff_pfx_01'`,

`stat_value`contains the number of distinct values for the first column of the given index. For rows with

`stat_name = 'n_diff_pfx_02'`,

`stat_value`contains the number of unique combinations of the two first columns of the given index, and so on. (Note that InnoDB indexes, in addition to the columns that were specified when the index was created, contain all columns of the primary key.)

```
mysql> select * from innodb.index_stats where table_name='customer';
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name | stat_timestamp | stat_name | stat_value | sample_size | stat_description |
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+
| dbt3 | customer | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx01 | 25 | 171 | c_nationkey |
| dbt3 | customer | i_c_nationkey | 2011-04-01 11:40:29 | n_diff_pfx02 | 150000 | 171 | c_nationkey,c_custkey |
| dbt3 | customer | i_c_nationkey | 2011-04-01 11:40:29 | n_leaf_pages | 171 | NULL | Number of leaf pages in the index |
| dbt3 | customer | i_c_nationkey | 2011-04-01 11:40:29 | size | 225 | NULL | Number of pages in the index |
| dbt3 | customer | PRIMARY | 2011-04-01 11:40:29 | n_diff_pfx01 | 149911 | 100 | c_custkey |
| dbt3 | customer | PRIMARY | 2011-04-01 11:40:29 | n_leaf_pages | 1746 | NULL | Number of leaf pages in the index |
| dbt3 | customer | PRIMARY | 2011-04-01 11:40:29 | size | 1764 | NULL | Number of pages in the index |
+---------------+------------+---------------+---------------------+--------------+------------+-------------+-----------------------------------+
```

**Manually Updating InnoDB Statistics Tables**

Since the statistics tables are normal tables, they can be updated like any other MySQL table. For our performance regression tests, we have decided to record our own numbers in these tables. That way, we will be able to get the exact same numbers even if we for some reason need to recreate the statistics tables (e.g, in another database instance).

When we manually update the statistics tables for the DBT-3 benchmarks, we first turn on persistent statistics and run ANALYZE once for each table. This inserts all necessary rows in the statistics tables, and we will only need to update the relevant rows. Here are an example UPDATE statement for each table:

```
UPDATE innodb.table_stats SET n_rows=150000
WHERE database_name='dbt3' AND table_name='customer' ;
UPDATE innodb.index_stats SET stat_value=25, sample_size=NULL
WHERE database_name='dbt3' AND table_name='customer'
AND index_name='i_c_nationkey' AND stat_name='n_diff_pfx01';
```

`index_stats.sample_size`to

`NULL`to indicate that the value is recorded and not computed by sampling.

**A World of New Possibilities**

As discussed above, InnoDB Persistent Statistics will be very useful in order to ensure that the same query execution plan is used every time a query is executed. It also gives the users the ability to control when the statistics are recalculated. However, persistent statistics opens up for even more new possibilities.

We will be able to use this feature to extend our testing of the query optimizer. By "faking" the statistics, we will be able to explore and test the execution of different query plans without having to actually modify the database.

It will also be possible for users to change the statistics in order to force a specific query plan. However, one risks introducing side-effects on other queries so this will have to be done with caution.