Thursday, June 25, 2020

Memory Usage for MySQL Hash Join

At Percona Live Online my colleague at Alibaba, Jim Tommaney, presented a study of hash join performance in MySQL and Marwho?iaDB. For some of the queries, the peformance characteristics for MySQL seemed a bit strange, and I decided to look closer into what was going on. As you can read below, I found that the crucial point was the memory usage of hash join in MySQL.

Results presented at Percona Live ONLINE

In Jims presentation, you can see the following graphs for the performance of his query 2.2 on MySQL and Marwho?iaDB:

We see that for Marwho?iaDB, the performance scales pretty linear with increasing selectivity factor, while for MySQL there is a "step-wise" increase. The query is run on a scale factor 10 DBT-3 database, and looks like this:

SELECT /*+ NO_INDEX(orders) NO_INDEX(lineitem) */ COUNT(*)
FROM orders
JOIN lineitem ON l_orderkey = o_orderkey
WHERE o_custkey <= 1500000 * @selectivity;

We are joining two tables, orders with 15 million rows and lineitem with 60 million rows. The WHERE condition is on the smallest table, and the query is run with varying selectivity factors. For hash join to be used, we have to disable the indexes on the join keys. Otherwise, the query optimizer would choose to use indexed nested loop join. (I use the new optimizer hints introduced in MySQL 8.0.20 for this.)

The impact of join buffer size

The way hash join works is that it will build a hash-table of one of the inputs, the build input, which is usually the smallest of the two inputs, and scan through the other input, the probe input, using the hash table to find matches. In this case, the rows from the orders tables that satisfies the WHERE condition will be inserted into the hash table. This means that a larger selectiviy factor will require more memory for the hash table. In MySQL, the amount of memory available for the hash table is determined by the session variable join_buffer_size. If the hash table is larger than the join buffer, MySQL will split both inputs into chunks that are stored on disk, and then apply the in-memory hash join algorithm to each pair of chunks. For more on how hash join is implemented in MySQL, I recommend Erik Frøseth's great presentation at this year's FOSDEM.

In Jim's experiments, the size of the join buffer was 512 MB. This size was chosen in order for Marwho?iaDB to be able to use hash join for the queries. Marwho?iaDB does not support disk-based hash join, so the join buffer needs to be large enough to hold the entire build input. My first experiment was to run Jim's query 2.2 with different sizes for the join buffer. The results are displayed in this graph:

We see that if we use a join buffer of 2 GB, we get linear scalability for this query also for MySQL. So it seems that MySQL requires more memory than Marwho?iaDB for the same query. I have identified 3 reasons for this:

  1. Only approx. 2/3 of the join buffer will be used for in-memory hash join. This happens because MySQL will switch from in-memory to hybrid hash join when the whole join buffer has been allocated, not when it has been filled up. I have filed Bug#99933 for this issue.
  2. MySQL will add columns to the join buffer that is not needed for the rest of the query evaluation. In our example query, the column o_custkey is only needed for the evaluation of the WHERE condition. This condition will be evaluated before the rows of the orders are inserted into the hash table. Hence, this column is not needed for the hash join or afterwards and could be dropped. I have filed Bug#99934 for this issue.
  3. The hash table implementation requires more space. MySQL uses std::unordered_multimap to implement the hash table. This requires more space than the "home-made" table of linked lists used by Marwho?iaDB. Using the debugger, I found that, for this query, MySQL allocates 72 bytes per row, of which 48 is for the hash table entry, 16 bytes for the two columns, and 8 bytes for the hash value of the join key. The latter is used for the key of the multimap. (I do not think Marwho?iaDB stores the hash value in the join buffer; it will just be used to identify which of the linked list to check for a match.) Note that this query is a bit extreme since it actually only needs to store one integer column per row in the hash table. For most queries, one will typical need to store more columns in the join buffer; columns for grouping, ordering, comparisons or joins with other tables, in addition to the columns of the query result. So for a typical query the relative overhead of the hash table will be significantly smaller.

What is the ideal size for the join buffer?

Of course, if we have enough memory, you will get the best performance if you can store the entire build input in the join buffer. However, for large tables with much relevant data per row, this may not be feasible. A query may also need multiple active join buffers at the same time, e.g., building a new hash table while probing the previous, and multiple queries may run concurrently. What our test shows, is that if in-memory hash join is not possible, a medium size join buffer is often better than a larger one. Looking at this graph, we see that for hybrid hash join, a 256 MB join buffer performs much worse than many smaller sizes:

I have not looked into the details of why, but my guess is that you will get better performance from hash join if the join buffer can fit in the CPU cache. From the above graph, it seems that for this particular query, on this particular machine, a 16 MB join buffer will give the best general performance for hybrid hash join.


We have seen that MySQL hash join has some inefficiencies with respect to memory usage. Still, hash join provides a speed-up over nested loops join for many queries. There are more examples of this in Jim's presentation. Unlike Marwho?iaDB, MySQL supports using disk to split up the hash join. That way, it is possible to efficiently join large tables without using much extra memory.

Saturday, October 12, 2019

MySQL Parallel Index Range Scans: One is Sometimes Better Than Two

As presented at the Percona Live Conference in Austin in May, at Alibaba we are working on adding support for parallel query execution to POLARDB for MySQL. As discussed in the presentation, we observe that parallel range scans on secondary indexes does not scale very well in MySQL. This is old news. Mark Callaghan reported this several years ago, and this does not seem to have changed. In this blog post, I will investigate what effects using a multi-processor machine have on the scalability of parallel index range scans.

I will run a variant of Query 6 of the TPC-H/DBT-3 benchmark. While the original Query 6 sums up the revenue for a whole year, my version will only compute it over a single month:

SELECT SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= '1997-01-01'
  AND l_shipdate < DATE_ADD('1997-01-01', INTERVAL '1' MONTH)
  AND l_discount BETWEEN 0.05 - 0.01 AND 0.05 + 0.01
  AND l_quantity < 25;

This query was run by multiple parallel threads using sysbench on MySQL 8.0.17. For each query execution, a random month was picked. The machine used have 2 processors with 16 physical cores each, so with hyper-threading there are in total 64 virtual CPUs. I used a DBT-3 scale factor 1 database, and with a 4 GB InnoDB buffer pool, all data was cached in memory. MySQL will use the secondary index on l_shipdate when executing this query. For comparison, I also ran the same query when forcing MySQL to use table scan instead of index range scan. The results are presented in this graph:

The throughput when using the secondary index is of course higher than for table scan, since we only need to read data for 1 month instead of for all 84 months when using a full table scan. However, we see that while the table scan scales almost linearly up to 32 threads, this is not the case for the index range scan. For 32 threads, the throughput with table scans is more than 30 times higher than with 1 thread. For index range scan, the increase is only about 7.8x and 9.5x for 32 and 64 threads, respectively.

We can see that for table scan, there is no benefit from hyper-threading; the maximum throughput is reached when having one thread per physical core. In other words, each thread doing table scan is able to execute without any significant interrupts. For index range scans, on the other hand, we see that we get higher throughput with 64 threads than with 32 threads. This is an indication that the threads are regularly stalled, and there is a benefit from having other threads that can be scheduled when this happens.

So why are the threads stalled? I think the most likely reason is that there is a synchronization bottleneck related to non-covering secondary index scans. Further investigations are needed to understand where, but my current guess is that it is related to accessing the root page of the B-tree when looking up the row corresponding to the index entry. (I have turned off the adaptive hash index (AHI) when running these tests, so all primary key look-ups will have to go through the root page. When using the AHI, the scaling seem to be even worse, but that is different story ...)

When there is a lot of thread synchronization, running on multiple processors may increase our problems since the state of the mutex, or whatever is used for synchronization, will have to be synchronized between the caches of the CPUs. To investigate what effects this have for our case, I will use Resource Groups to make MySQL use only one processor for this query. First, I will create a resource group, cpu1, that contains the virtual CPUs of processor 1:

CREATE RESOURCE GROUP cpu1 TYPE=user VCPU=16-31,48-63;

To make our query use this resource group, we add a hint to the query:

       SUM(l_extendedprice * l_discount) AS revenue
FROM lineitem
WHERE l_shipdate >= '1997-01-01'
  AND l_shipdate < DATE_ADD('1997-01-01', INTERVAL '1' MONTH)
  AND l_discount BETWEEN 0.05 - 0.01 AND 0.05 + 0.01
  AND l_quantity < 25;

We repeat the sysbench runs with this query, and compare the results for index range scans:

We see that using only the cores of one processor, increases the performance. The maximum throughput is increased by almost 30% by using one processor instead of two! The "scale factors" are now 10.3x and 12.6x for 16 and 32 threads, respectively. Much better, but still a bit away from perfect scaling. Hence, there is still a need for further investigations. Stay tuned!

Tuesday, May 28, 2019

My presentations at Percona Live

Percona Live 2019 is starting in Austin today.  You may already have read that I have a presentation on how to use Optimizer Trace to understand the inner workings of the MySQL Query Optimizer: The MySQL Query Optimizer Explained Through Optimizer Trace on Wednesday at 2:55pm.
I hope to see many of you there!

Together with my colleague Benny Wang, I will also be presenting the current work we do at Alibaba to add parallel query processing to PolarDB for MySQL.  If you want to learn more about our work, come to the presentation on Thursday at 11:00am.

Many other of my colleagues at Alibaba have  presentations on the MySQL related work we do for Alibaba Cloud.  Please, check out the AliSQL & POLARDB track on Thursday!

Wednesday, January 30, 2019

Inspecting the Content of a MySQL Histogram

FOSDEM is coming up. I do not have a presentation in the MySQL and Friends devroom this year, but it reminded me that I had planned to post a follow-up to my presentation from last year.

As part of the presentation, I showed how you can inspect the content of a histogram using the information schema table column_statistics. For example, the following query will show the content of the histogram for the column l_linenumber in the table lineitem of the dbt3_sf1 database:

FROM information_schema.column_statistics
WHERE schema_name = 'dbt3_sf1'
 AND table_name ='lineitem'
 AND column_name = 'l_linenumber';

The histogram is stored as a JSON document:

 "buckets": [[1, 0.24994938524948698], [2, 0.46421066400720523],
 [3, 0.6427401784471978], [4, 0.7855470933802572],
 [5, 0.8927398868395817], [6, 0.96423707532558], [7, 1] ],
 "data-type": "int",
 "null-values": 0.0,
 "collation-id": 8,
 "last-updated": "2018-02-03 21:05:21.690872",
 "sampling-rate": 0.20829115437457252,
 "histogram-type": "singleton",
 "number-of-buckets-specified": 1024

The distribution of values can be found in the buckets array of the JSON document. In the above case, the histogram type is singleton. That means that each bucket contains the frequency of a single value. For the other type of histogram, equi-height, each bucket will contain the minimum and maximum value for the range covered by the bucket. The frequency value recorded, is the cumulative frequency. That is, it gives the frequency of values smaller than the maximum value of the bucket. In the example above, 64.27% of the values in the l_linenumber column is less than or equal to 3.

In other words, if you have created a histogram for a column, you can query the information schema table to get estimates on column values. This will normally be much quicker than to get an exact result by querying the actual table.

As discussed in my FOSDEM presentation, string values are base64 encoded in the histogram. At the time of the presentation, using MySQL 8.0.11, it was a bit complicated to decode these string values. However, from MySQl 8.0.12 on, this has become simpler. As stated in the release notes for MySQL 8.0.12:

The JSON_TABLE() function now automatically decodes base-64 values and prints them using the character set given by the column specification.

JSON_TABLE is a table function that will convert a JSON array to a relational table with one row per element of the array. We can use JSON_TABLE to extract the buckets of the histogram into a relational table:

SELECT v value, c cumulfreq
FROM information_schema.column_statistics,
     JSON_TABLE(histogram->'$.buckets', '$[*]'
                COLUMNS(v VARCHAR(60) PATH '$[0]',
                        c double PATH '$[1]')) hist
WHERE schema_name = 'dbt3_sf1'
  AND table_name ='orders'
  AND column_name = 'o_orderstatus';
Running the above query on my DBT3 database, I get the following result:
| value | cumulfreq           |
| F     | 0.48544670343055835 |
| O     |  0.9743427900693199 |
| P     |                   1 |

The above gives the cumulative frequencies. Normally, I would rather want to see the actual frequencies of each value, and to get that I will need to subtract the value of the previous row. We can use a window function to do that:

mysql> SELECT v value, c cumulfreq,  c - LAG(c, 1, 0) OVER () freq
    -> FROM information_schema.column_statistics,
    ->      JSON_TABLE(histogram->'$.buckets', '$[*]'
    ->          COLUMNS(v VARCHAR(60) PATH '$[0]',
    ->                    c double PATH '$[1]')) hist
    -> WHERE schema_name = 'dbt3_sf1'
    ->   AND table_name ='orders'
    ->   AND column_name = 'o_orderstatus';
| value | cumulfreq           | freq                 |
| F     | 0.48544670343055835 |  0.48544670343055835 |
| O     |  0.9743427900693199 |  0.48889608663876155 |
| P     |                   1 | 0.025657209930680103 |
3 rows in set (0.00 sec)

So by combining three new features in MySQL 8.0, histogram, JSON_TABLE, and window functions, I am able to quickly get an estimate for the frequencies of the possible values for my column.

Wednesday, February 28, 2018

FOSDEM: Histogram support in MySQL 8.0

The slides and video from my FOSDEM presentation on histogram support in MySQL can now be downloaded.  Check it out if you want to learn more about how you can create histograms to improve query plans in MySQL 8.0.

The presentation shows you how to create histograms over column values, 
presents the types of histograms you can create, and discusses best practices for using histograms.

Prior to MySQL 8.0, the query optimizer lacked statistics on columns that are not indexed.  With histograms the query optimizer will be able to make more accurate selectivity estimates for conditions on such columns.   With better estimates, the query optimizer can produce better query plans, especially for join queries.  Check out the presentation for a couple of examples of join queries that runs faster due to histograms!

Thursday, February 1, 2018

Speaking at FOSDEM

I will be speaking at FOSDEM the coming Sunday (February 4) on Histogram support in MySQL 8.0. If you are at FOSDEM, and want to learn about how you can use histograms to improve your query execution plans, visit the MySQL and Friends devroom at 11:10.

Also, please, checkout the entire program for the MySQL devroom.  It is full of interesting talks.

Friday, January 26, 2018

Query Optimizer Takes Data Buffering into Account

Last month I published a blog post at about how the Query Optimizer in MySQL 8.0 takes advantage of that InnoDB provides buffer estimates per table and index. In that blog post I showed how we got different query plans for Query 8 of the DBT-3 benchmark depending on whether the data was cached InnoDB's buffer pool or not.

In MySQL 5.6, we got a query plan that was well suited for a disk-bound scenario, while MySQL 5.7 switched to a query plan that reduces the execution time by 90% when all data is in memory. However, that came at the expense of longer execution times when data had to be read from disk. In MySQL 8.0 you get the best of both worlds. One query plan will be used when all data is in memory, and another plan will be used when most data need to be fetched from disk.

DBT-3 Query 21

I earlier blogged about another DBT-3 query, Query 21. This query showed a performance regression from MySQL 5.6 to MySQL 5.7 because the query optimizer overestimated the filtering effect of the WHERE clause. In that blog post, I promised that this can be avoided in MySQL 8.0 since you can create histograms to improve the filtering estimates. It turns out that the problem is avoided even without histograms.

In MySQL 5.7 we got this query plan for Query 21:

As discussed in the previous blog post, in MySQL 5.7, the optimizer chooses to start with table orders because it overestimates the filtering on this table. In MySQL 8.0 we still get the above query plan when all the data needs to be read from disk. However, if all data is in memory we get the following query plan:

As you can see from the diagram, the join order has been reversed. Using this new query plan reduces the execution time by 85% when all data is in memory. However, since this plan uses secondary indexes for two of the tables, it is not a good plan when most data has to be read from disk. In fact, if the optimizer had picked this plan in a disk-bound scenario, the query would have been 2.5 times slower than with the plan from 5.7. Hence, by taking into account whether data is in memory or needs to be read from disk, the optimizer is able to find a good plan for Query 21 in both scenarios.


One thing to be aware of with this new optimizer behavior, is that running EXPLAIN on a query after it was executed, will not necessarily show the query plan that was used during execution. The previous execution of the query may have changed the content of the buffer pool. Hence, the optimizer may pick a different query plan for the next execution of the query.