At Percona Live Online my colleague at Alibaba, Jim Tommaney, presented a study of hash join performance in MySQL and Marwho?iaDB. For some of the queries, the peformance characteristics for MySQL seemed a bit strange, and I decided to look closer into what was going on. As you can read below, I found that the crucial point was the memory usage of hash join in MySQL.
Results presented at Percona Live ONLINE
In Jims presentation, you can see the following graphs for the performance of his query 2.2 on MySQL and Marwho?iaDB:
We see that for Marwho?iaDB, the performance scales pretty linear with increasing selectivity factor, while for MySQL there is a "step-wise" increase. The query is run on a scale factor 10 DBT-3 database, and looks like this:
SELECT /*+ NO_INDEX(orders) NO_INDEX(lineitem) */ COUNT(*) FROM orders JOIN lineitem ON l_orderkey = o_orderkey WHERE o_custkey <= 1500000 * @selectivity;
We are joining two tables, orders with 15 million rows and lineitem with 60 million rows. The WHERE condition is on the smallest table, and the query is run with varying selectivity factors. For hash join to be used, we have to disable the indexes on the join keys. Otherwise, the query optimizer would choose to use indexed nested loop join. (I use the new optimizer hints introduced in MySQL 8.0.20 for this.)
The impact of join buffer size
The way hash join works is that it will build a hash-table of one of the inputs, the build input, which is usually the smallest of the two inputs, and scan through the other input, the probe input, using the hash table to find matches. In this case, the rows from the orders tables that satisfies the WHERE condition will be inserted into the hash table. This means that a larger selectiviy factor will require more memory for the hash table. In MySQL, the amount of memory available for the hash table is determined by the session variable join_buffer_size. If the hash table is larger than the join buffer, MySQL will split both inputs into chunks that are stored on disk, and then apply the in-memory hash join algorithm to each pair of chunks. For more on how hash join is implemented in MySQL, I recommend Erik Frøseth's great presentation at this year's FOSDEM.
In Jim's experiments, the size of the join buffer was 512 MB. This size was chosen in order for Marwho?iaDB to be able to use hash join for the queries. Marwho?iaDB does not support disk-based hash join, so the join buffer needs to be large enough to hold the entire build input. My first experiment was to run Jim's query 2.2 with different sizes for the join buffer. The results are displayed in this graph:
We see that if we use a join buffer of 2 GB, we get linear scalability for this query also for MySQL. So it seems that MySQL requires more memory than Marwho?iaDB for the same query. I have identified 3 reasons for this:
- Only approx. 2/3 of the join buffer will be used for in-memory hash join. This happens because MySQL will switch from in-memory to hybrid hash join when the whole join buffer has been allocated, not when it has been filled up. I have filed Bug#99933 for this issue.
- MySQL will add columns to the join buffer that is not needed for the rest of the query evaluation. In our example query, the column o_custkey is only needed for the evaluation of the WHERE condition. This condition will be evaluated before the rows of the orders are inserted into the hash table. Hence, this column is not needed for the hash join or afterwards and could be dropped. I have filed Bug#99934 for this issue.
- The hash table implementation requires more space. MySQL uses std::unordered_multimap to implement the hash table. This requires more space than the "home-made" table of linked lists used by Marwho?iaDB. Using the debugger, I found that, for this query, MySQL allocates 72 bytes per row, of which 48 is for the hash table entry, 16 bytes for the two columns, and 8 bytes for the hash value of the join key. The latter is used for the key of the multimap. (I do not think Marwho?iaDB stores the hash value in the join buffer; it will just be used to identify which of the linked list to check for a match.) Note that this query is a bit extreme since it actually only needs to store one integer column per row in the hash table. For most queries, one will typical need to store more columns in the join buffer; columns for grouping, ordering, comparisons or joins with other tables, in addition to the columns of the query result. So for a typical query the relative overhead of the hash table will be significantly smaller.
What is the ideal size for the join buffer?
Of course, if we have enough memory, you will get the best performance if you can store the entire build input in the join buffer. However, for large tables with much relevant data per row, this may not be feasible. A query may also need multiple active join buffers at the same time, e.g., building a new hash table while probing the previous, and multiple queries may run concurrently. What our test shows, is that if in-memory hash join is not possible, a medium size join buffer is often better than a larger one. Looking at this graph, we see that for hybrid hash join, a 256 MB join buffer performs much worse than many smaller sizes:
I have not looked into the details of why, but my guess is that you will get better performance from hash join if the join buffer can fit in the CPU cache. From the above graph, it seems that for this particular query, on this particular machine, a 16 MB join buffer will give the best general performance for hybrid hash join.
We have seen that MySQL hash join has some inefficiencies with respect to memory usage. Still, hash join provides a speed-up over nested loops join for many queries. There are more examples of this in Jim's presentation. Unlike Marwho?iaDB, MySQL supports using disk to split up the hash join. That way, it is possible to efficiently join large tables without using much extra memory.