Tuesday, May 28, 2019

My presentations at Percona Live

Percona Live 2019 is starting in Austin today.  You may already have read that I have a presentation on how to use Optimizer Trace to understand the inner workings of the MySQL Query Optimizer: The MySQL Query Optimizer Explained Through Optimizer Trace on Wednesday at 2:55pm.
I hope to see many of you there!

Together with my colleague Benny Wang, I will also be presenting the current work we do at Alibaba to add parallel query processing to PolarDB for MySQL.  If you want to learn more about our work, come to the presentation on Thursday at 11:00am.

Many other of my colleagues at Alibaba have  presentations on the MySQL related work we do for Alibaba Cloud.  Please, check out the AliSQL & POLARDB track on Thursday!

Wednesday, January 30, 2019

Inspecting the Content of a MySQL Histogram

FOSDEM is coming up. I do not have a presentation in the MySQL and Friends devroom this year, but it reminded me that I had planned to post a follow-up to my presentation from last year.

As part of the presentation, I showed how you can inspect the content of a histogram using the information schema table column_statistics. For example, the following query will show the content of the histogram for the column l_linenumber in the table lineitem of the dbt3_sf1 database:

SELECT JSON_PRETTY(histogram)
FROM information_schema.column_statistics
WHERE schema_name = 'dbt3_sf1'
 AND table_name ='lineitem'
 AND column_name = 'l_linenumber';

The histogram is stored as a JSON document:

{
 "buckets": [[1, 0.24994938524948698], [2, 0.46421066400720523],
 [3, 0.6427401784471978], [4, 0.7855470933802572],
 [5, 0.8927398868395817], [6, 0.96423707532558], [7, 1] ],
 "data-type": "int",
 "null-values": 0.0,
 "collation-id": 8,
 "last-updated": "2018-02-03 21:05:21.690872",
 "sampling-rate": 0.20829115437457252,
 "histogram-type": "singleton",
 "number-of-buckets-specified": 1024
}

The distribution of values can be found in the buckets array of the JSON document. In the above case, the histogram type is singleton. That means that each bucket contains the frequency of a single value. For the other type of histogram, equi-height, each bucket will contain the minimum and maximum value for the range covered by the bucket. The frequency value recorded, is the cumulative frequency. That is, it gives the frequency of values smaller than the maximum value of the bucket. In the example above, 64.27% of the values in the l_linenumber column is less than or equal to 3.

In other words, if you have created a histogram for a column, you can query the information schema table to get estimates on column values. This will normally be much quicker than to get an exact result by querying the actual table.

As discussed in my FOSDEM presentation, string values are base64 encoded in the histogram. At the time of the presentation, using MySQL 8.0.11, it was a bit complicated to decode these string values. However, from MySQl 8.0.12 on, this has become simpler. As stated in the release notes for MySQL 8.0.12:

The JSON_TABLE() function now automatically decodes base-64 values and prints them using the character set given by the column specification.

JSON_TABLE is a table function that will convert a JSON array to a relational table with one row per element of the array. We can use JSON_TABLE to extract the buckets of the histogram into a relational table:

SELECT v value, c cumulfreq
FROM information_schema.column_statistics,
     JSON_TABLE(histogram->'$.buckets', '$[*]'
                COLUMNS(v VARCHAR(60) PATH '$[0]',
                        c double PATH '$[1]')) hist
WHERE schema_name = 'dbt3_sf1'
  AND table_name ='orders'
  AND column_name = 'o_orderstatus';
Running the above query on my DBT3 database, I get the following result:
+-------+---------------------+
| value | cumulfreq           |
+-------+---------------------+
| F     | 0.48544670343055835 |
| O     |  0.9743427900693199 |
| P     |                   1 |
+-------+---------------------+

The above gives the cumulative frequencies. Normally, I would rather want to see the actual frequencies of each value, and to get that I will need to subtract the value of the previous row. We can use a window function to do that:

mysql> SELECT v value, c cumulfreq,  c - LAG(c, 1, 0) OVER () freq
    -> FROM information_schema.column_statistics,
    ->      JSON_TABLE(histogram->'$.buckets', '$[*]'
    ->          COLUMNS(v VARCHAR(60) PATH '$[0]',
    ->                    c double PATH '$[1]')) hist
    -> WHERE schema_name = 'dbt3_sf1'
    ->   AND table_name ='orders'
    ->   AND column_name = 'o_orderstatus';
+-------+---------------------+----------------------+
| value | cumulfreq           | freq                 |
+-------+---------------------+----------------------+
| F     | 0.48544670343055835 |  0.48544670343055835 |
| O     |  0.9743427900693199 |  0.48889608663876155 |
| P     |                   1 | 0.025657209930680103 |
+-------+---------------------+----------------------+
3 rows in set (0.00 sec)

So by combining three new features in MySQL 8.0, histogram, JSON_TABLE, and window functions, I am able to quickly get an estimate for the frequencies of the possible values for my column.

Wednesday, February 28, 2018

FOSDEM: Histogram support in MySQL 8.0


The slides and video from my FOSDEM presentation on histogram support in MySQL can now be downloaded.  Check it out if you want to learn more about how you can create histograms to improve query plans in MySQL 8.0.

The presentation shows you how to create histograms over column values, 
presents the types of histograms you can create, and discusses best practices for using histograms.

Prior to MySQL 8.0, the query optimizer lacked statistics on columns that are not indexed.  With histograms the query optimizer will be able to make more accurate selectivity estimates for conditions on such columns.   With better estimates, the query optimizer can produce better query plans, especially for join queries.  Check out the presentation for a couple of examples of join queries that runs faster due to histograms!


Thursday, February 1, 2018

Speaking at FOSDEM

I will be speaking at FOSDEM the coming Sunday (February 4) on Histogram support in MySQL 8.0. If you are at FOSDEM, and want to learn about how you can use histograms to improve your query execution plans, visit the MySQL and Friends devroom at 11:10.

Also, please, checkout the entire program for the MySQL devroom.  It is full of interesting talks.

Friday, January 26, 2018

Query Optimizer Takes Data Buffering into Account

Last month I published a blog post at mysqlserverteam.com about how the Query Optimizer in MySQL 8.0 takes advantage of that InnoDB provides buffer estimates per table and index. In that blog post I showed how we got different query plans for Query 8 of the DBT-3 benchmark depending on whether the data was cached InnoDB's buffer pool or not.

In MySQL 5.6, we got a query plan that was well suited for a disk-bound scenario, while MySQL 5.7 switched to a query plan that reduces the execution time by 90% when all data is in memory. However, that came at the expense of longer execution times when data had to be read from disk. In MySQL 8.0 you get the best of both worlds. One query plan will be used when all data is in memory, and another plan will be used when most data need to be fetched from disk.

DBT-3 Query 21

I earlier blogged about another DBT-3 query, Query 21. This query showed a performance regression from MySQL 5.6 to MySQL 5.7 because the query optimizer overestimated the filtering effect of the WHERE clause. In that blog post, I promised that this can be avoided in MySQL 8.0 since you can create histograms to improve the filtering estimates. It turns out that the problem is avoided even without histograms.

In MySQL 5.7 we got this query plan for Query 21:

As discussed in the previous blog post, in MySQL 5.7, the optimizer chooses to start with table orders because it overestimates the filtering on this table. In MySQL 8.0 we still get the above query plan when all the data needs to be read from disk. However, if all data is in memory we get the following query plan:

As you can see from the diagram, the join order has been reversed. Using this new query plan reduces the execution time by 85% when all data is in memory. However, since this plan uses secondary indexes for two of the tables, it is not a good plan when most data has to be read from disk. In fact, if the optimizer had picked this plan in a disk-bound scenario, the query would have been 2.5 times slower than with the plan from 5.7. Hence, by taking into account whether data is in memory or needs to be read from disk, the optimizer is able to find a good plan for Query 21 in both scenarios.

Caveat

One thing to be aware of with this new optimizer behavior, is that running EXPLAIN on a query after it was executed, will not necessarily show the query plan that was used during execution. The previous execution of the query may have changed the content of the buffer pool. Hence, the optimizer may pick a different query plan for the next execution of the query.

Friday, November 24, 2017

Going Beyond Tabular EXPLAIN

A while ago, Lukas Eder posted a very interesting article on query optimizations that do not depend on the cost model. We often call such optimizations query transformations since they can be applied by rewriting the query.

In his blog post, Lukas investigated how different database systems handle different opportunities for query transformations. For MySQL, he complained that in some cases, the output from EXPLAIN is not sufficient to tell what is going on. However, as I will show in this blog post, there are other ways to get the information that he was looking for.

The EXPLAIN Warning

What may easily be overlooked when executing EXPLAIN for a query, is that it literally comes with a warning. This warning shows the query after the query transformations have been applied. Let's look at the query Lukas used to explore Transitive Closure:

SELECT first_name, last_name, film_id
FROM actor a
JOIN film_actor fa ON a.actor_id = fa.actor_id
WHERE a.actor_id = 1;

If we run EXPLAIN for this query, and display the associated warning, we see:

mysql> EXPLAIN SELECT first_name, last_name, film_id FROM actor a JOIN film_actor fa ON a.actor_id = fa.actor_id WHERE a.actor_id = 1;
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | a     | NULL       | const | PRIMARY       | PRIMARY | 2       | const |    1 |   100.00 | NULL        |
|  1 | SIMPLE      | fa    | NULL       | ref   | PRIMARY       | PRIMARY | 2       | const |   19 |   100.00 | Using index |
+----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+
2 rows in set, 1 warning (0,00 sec)

mysql> show warnings;
+-------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                                                                                                                       |
+-------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select 'PENELOPE' AS `first_name`,'GUINESS' AS `last_name`,`sakila`.`fa`.`film_id` AS `film_id` from `sakila`.`actor` `a` join `sakila`.`film_actor` `fa` where (`sakila`.`fa`.`actor_id` = 1) |
+-------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0,00 sec)

If you scroll to the right, you will see that the warning contains: `sakila`.`fa`.`actor_id` = 1. In other words, the predicate actor_id = 1 has been applied to the film_actor table because of transitive closure.

The above warning message also illustrates another aspect of MySQL query optimization. Since the query specifies the value for the primary key of the actor table, the primary key look-up will be done in the optimizer phase. Hence, the warning shows that columns from the actor table have been replaced by the column values for the requested row.

Structured EXPLAIN

MySQL 5.6 introduced Structured EXPLAIN. Its output describes the query plan in JSON format. This output contains additional information compared to traditional EXPLAIN. For example, while the tabular EXPLAIN only says "Using where" when a condition is applied when reading a table, the JSON output will display the actual condition. Let's look at the output for the first example on Removing “Silly” Predicates:

mysql> EXPLAIN FORMAT=JSON SELECT * FROM film WHERE release_year = release_year;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "212.00"
    },
    "table": {
      "table_name": "film",
      "access_type": "ALL",
      "rows_examined_per_scan": 1000,
      "rows_produced_per_join": 100,
      "filtered": "10.00",
      "cost_info": {
        "read_cost": "192.00",
        "eval_cost": "20.00",
        "prefix_cost": "212.00",
        "data_read_per_join": "78K"
      },
      "used_columns": [
        "film_id",
        "title",
        "description",
        "release_year",
        "language_id",
        "original_language_id",
        "rental_duration",
        "rental_rate",
        "length",
        "replacement_cost",
        "rating",
        "special_features",
        "last_update"
      ],
      "attached_condition": "(`sakila`.`film`.`release_year` = `sakila`.`film`.`release_year`)"
    }
  }
} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The value for "attached_condition" shows that MySQL does not simplify the condition release_year = release_year to release_year IS NOT NULL. So Lukas is right in his educated guess that MySQL does not optimize this. However, if we look at a similar silly predicate on a NOT NULL column, we see that such predicates are eliminated by MySQL:

mysql> EXPLAIN FORMAT=JSON SELECT * FROM film WHERE film_id = film_id;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "212.00"
    },
    "table": {
      "table_name": "film",
      "access_type": "ALL",
      "rows_examined_per_scan": 1000,
      "rows_produced_per_join": 1000,
      "filtered": "100.00",
      "cost_info": {
        "read_cost": "12.00",
        "eval_cost": "200.00",
        "prefix_cost": "212.00",
        "data_read_per_join": "781K"
      },
      "used_columns": [
        "film_id",
        "title",
        "description",
        "release_year",
        "language_id",
        "original_language_id",
        "rental_duration",
        "rental_rate",
        "length",
        "replacement_cost",
        "rating",
        "special_features",
        "last_update"
      ]
    }
  }
} |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In this case, there is no "attached_condition" in the JSON output. In other words, the silly predicate has been removed.

Optimizer Trace

Predicate Merging investigates whether the optimizer will merge two predicates on the same column. There are actually two different aspects here:

  1. Whether predicates are merged and evaluated as a single predicate
  2. Whether key ranges as specified by the query are merged when setting up index range scans
The latter is the most important since it ensures that not more rows than necessary are accessed. Lukas concludes that MySQL does predicate merging based on the estimated number of rows that tabular EXPLAIN shows:
mysql> EXPLAIN SELECT * 
    -> FROM actor
    -> WHERE actor_id IN (2, 3, 4)
    -> AND actor_id IN (1, 2, 3);
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | actor | NULL       | range | PRIMARY       | PRIMARY | 2       | NULL |    2 |   100.00 | Using where |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0,00 sec)
Here EXPLAIN shows that the estimated number of rows to be read is 2, and this fits with how many rows satisfy the merged predicates.

If we look at the output from structured EXPLAIN, we get a different picture:

mysql> EXPLAIN FORMAT=JSON SELECT *  FROM actor WHERE actor_id IN (2, 3, 4) AND actor_id IN (1, 2, 3);
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "2.81"
    },
    "table": {
      "table_name": "actor",
      "access_type": "range",
      "possible_keys": [
        "PRIMARY"
      ],
      "key": "PRIMARY",
      "used_key_parts": [
        "actor_id"
      ],
      "key_length": "2",
      "rows_examined_per_scan": 2,
      "rows_produced_per_join": 2,
      "filtered": "100.00",
      "cost_info": {
        "read_cost": "2.41",
        "eval_cost": "0.40",
        "prefix_cost": "2.81",
        "data_read_per_join": "560"
      },
      "used_columns": [
        "actor_id",
        "first_name",
        "last_name",
        "last_update"
      ],
      "attached_condition": "((`sakila`.`actor`.`actor_id` in (2,3,4)) and (`sakila`.`actor`.`actor_id` in (1,2,3)))"
    }
  }
} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
As you can see from "attached_condition", the predicates are not actually merged. It is only the key ranges that are merged. Structured EXPLAIN does not show what key ranges are used for the index range scan. However, we can use Optimizer Trace to find this information. (See instructions on how to obtain the optimizer trace.) The optimizer trace for the above query contains this part:
                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "PRIMARY",
                      "rows": 2,
                      "ranges": [
                        "2 <= actor_id <= 2",
                        "3 <= actor_id <= 3"
                      ]
                    },
                    "rows_for_plan": 2,
                    "cost_for_plan": 2.41,
                    "chosen": true
                  }
This shows that the range optimizer has actually set up two ranges here; one for actor_id = 2 and one for actor_id = 3. (In other words, the range optimizer does not seem to take into account that actor_id is an integer column.)

It is the same story for Lukas' other query investigating predicate merging. That query specifies two overlapping key ranges:

SELECT *
FROM film
WHERE film_id BETWEEN 1 AND 100
AND film_id BETWEEN 99 AND 200;

Also in this case structured EXPLAIN shows that predicates are not merged, while optimizer trace shows that the key ranges are merged:

                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "PRIMARY",
                      "rows": 2,
                      "ranges": [
                        "99 <= film_id <= 100"
                      ]
                    },
                    "rows_for_plan": 2,
                    "cost_for_plan": 2.41,
                    "chosen": true
                  }
Since this query has range predicates instead of equality predicates, MySQL will here set up a single range scan from 99 to 100.

Concluding Remarks

I have in this blog post shown some examples of how you can get additional information about a query plan, beyond what you can see in the output of tabular EXPLAIN. There is a lot of other information that you can deduct from looking at the EXPLAIN Warning, Structured EXPLAIN, or Optimizer Trace than what I have discussed here. That can be the topic for later blog posts.

Friday, April 21, 2017

Speaking at Percona Live

Percona Live is next week, and on Monday morning I will give a tutorial on "How to Analyze and Tune MySQL Queries for Better Performance".  I hope to see some of you there!  

For those of you that are not able to come, I recommend my on demand webinar on the same topic.  I will also give talks about query tuning at the upcoming Oracle MySQL Innovation Days to be held in the Bay Area (April 28) and in the Boston Area (May 2).

I would also like to recommend the following MySQL optimizer related talks at Percona Live: