Monday, January 16, 2017

Improving the Stability of MySQL Single-Threaded Benchmarks

I have for some years been running the queries of the DBT-3 benchmark, both to verify the effect of new query optimization features, and to detect any performance regressions that may have been introduced. However, I have had issues with getting stable results. While repeated runs of a query is very stable, I can get quite different results if I restart the server. As an example, I got a coefficient of variation (CoV) of 0.1% for 10 repeated executions of a query on the same server, while the CoV for the average runtime of 10 such experiments was over 6%!

With such large variation in results, significant performance regressions may not be noticed. I have tried a lot of stuff to get more stable runs, and in this blog post I will write about the things that I have found to have positive effects on stability. At the end, I will also list the things I have tried that did not show any positive effects.

Test Enviroment

First, I will describe the setup for the tests I run. All tests are run on a 2-socket box running Oracle Linux 7. The CPUs are Intel Xeon E5-2690 (Sandy Bridge) with 8 physical cores @ 2.90GHz.

I always bind the MySQL server to a single CPU, using taskset or numactl, and Turbo Boost is disabled. The computer has 128 GB of RAM, and the InnoDB buffer pool is big enough to contain the entire database. (4 GB buffer pool for scale factor 1 and 32 GB buffer pool for scale factor 10.)

Each test run is as follows:

  1. Start the server
  2. Run a query 20 times in sequence
  3. Repeat step 2 for all DBT-3 queries
  4. Stop the server

The result for each query will be the average execution times of the last 10 runs. The reason for the long warm-up period is that, from experience, when InnoDB's Adaptive Hash Index is on, you will need 8 runs or so before execution times are stable.

As I wrote above, the variance within each test run is very small, but the difference between test runs can be large. The variance is somewhat improved by picking the best result out of three test runs, but it is still not satisfactory. Also, a full test run on a scale factor 10 database takes 9 hours, so I would like to avoid having to repeat the tests multiple times.

Address Space Layout Randomization

A MySQL colleague mentioned that he had heard about some randomization that was possible to disable. After some googling, I learned about Address Space Layout Randomization (ASLR). This is a security technique that is used to prevent an attacker from being able to determine where code and data are located in the address space of a process. I also found some instructions on stackoverflow for how to disable it on Linux.

Turning off ASLR sure made a difference! Take a look at this graph that shows the average execution time for Query 12 in ten different test runs with and without ASLR (All runs are with a scale factor 1 DBT-3 database on MySQL 8.0.0 DMR):

I will definitely make sure ASLR is disabled in future tests!

Adaptive Hash Index

InnoDB maintains an Adaptive Hash Index (AHI) for frequently accessed pages. The hash index will speed up look-ups on primary key, and is also useful for secondary index accesses since a primary key look-up is needed to get from the index entry to the corresponding row. Some DBT-3 queries run twice as slow if I turn off AHI, so it has definitely some value. However, experience shows that I will have to repeat a query several times before the AHI is actually built for all pages accessed by the query. I plan to write another blog post where I discuss more about AHI.

Until I stumbled across ASLR, turning off AHI was my best bet at stabilizing the results. After disabling ASLR, also turning off AHI only shows a slight improvement in stability. However, there are other reasons for turning off AHI.

I have observed some times that with AHI on, a change of query plan for one query may affect the execution time of subsequent queries. I suspect the reason is that the content of the AHI after a query has been run, may change with a different query plan. Hence, the next query may be affected if it accesses the same data pages.

Turning off AHI also means that I no longer need the long warm-up period for the timing to stabilize. I can then repeat each query 10 times instead of 20 times. This means that even if many of the queries take longer to execute, the total time to execute a test run will be lower.

Because of the above, I have decided to turn off AHI in most of my test runs. However, I will run with AHI on once in a while just to make sure that there are no major regressions in AHI performance.

Preload Data and Indexes

I also tried to start each test run with a set of queries that would sequentially scan all tables and indexes. My thinking was that this could give a more deterministic placement of data in memory. Before I turned off ASLR, preloading had very good effects on the stability when AHI was disabled. With ASLR off, the difference is less significant, but there is still a slight improvement.

Below is a table that shows some results for all combinations of the settings discussed so far. Ten test runs were performed for each combination on a scale factor 1 database. The numbers shown is the average difference between the best and the worst runs over all queries, and the largest difference between the best and the worst runs for a single query.

ASLR AHI Preload Avg(MAX-MIN) Max(MAX-MIN)
6.18% 14.75%
4.65% 14.79%
5.56% 14.65%
2.18% 5.05%
1.66% 3.94%
1.58% 3.58%
1.66% 3.78%
1.09% 3.27%

From the above table it is clear that the most stable runs are achieved by using preloading in combination with disabling both ASLR and AHI.

For one of the DBT-3 queries, using preloading on a scale factor 10 database leads to higher variance within a test run. While the CoV within a test run is below 0.2% for all queries without preloading, query 21 has a CoV of 3% with preloading. I am currently investigating this, and I have indications that the variance can be reduced by setting the memory placement policy to interleave. I guess the reason is that with a 32 GB InnoDB buffer pool, one will not be able to allocate all memory locally to the CPU where the server is running.

What Did Not Have an Effect?

Here is a list of things I have tried that did not seem to have a positive effect on the stability of my results:

  • Different governors for CPU frequency scaling. I have chosen the performance governor, because it "sounds good", but I did not see any difference when using the powersave governor instead. I also tried turning off the Intel pstate driver, but did not notice any difference in that case either.
  • Bind the MySQL server to a single core or thread (instead of CPU).
  • Bind the MySQL client to a single CPU.
  • Different settings for NUMA memory placement policy using numactl. (With the possible exception of using interleave for scale factor 10 as mentioned above.)
  • Different memory allocation libraries (jemalloc, tcmalloc). jemalloc actually seemed to increase the instability.
  • Disable InnoDB buffer pool preloading: innodb_buffer_pool_load_at_startup = off
  • Set innodb_old_blocks_time = 0

Conclusion

My tests have shown that I get better stability of test results if I disable both ASLR and AHI, and that combining this with preloading of all tables and indexes in many cases will further improve the stability of my test setup. (Note that I do not recommend to disable ASLR in a production environment. That could make you more vulnerable to attacks.)

I welcome any comments and suggestions on how to further increase the stability for MySQL benchmarks. I do not claim to be an expert in this field, and any input will be highly appreciated.

5 comments:

  1. Hi, I spent almost one year to analyze all factors impacting Python benchmarks especially "microbenchmarks" (benchmarks where a single loop takes less than 1 millisecond, or even less than 1 microsecond). I found a long list of factors impacting benchmarks:
    https://haypo-notes.readthedocs.io/microbenchmark.html#microbenchmarks

    I took a different decision on ASLR: I make sure that it's enabled! I wrote a Python perf module which spawn sequentially worker processes to run benchmarks. By default, a benchmark runs 20 processes. It's the minimum to get a reliable average. With enough samples, ASLR has no impact on the average.

    If you disable ASLR, your benchmark gives you one possible timing. Since ASLR is enabled by default, users are likely to get different timings: better or worse. Anyway, ASLR is a single factor impacting performance, they are many others: command line argument and environment variables for example. It's more *reliable* to spawn the benchmark in multiple processes (sequentially) and compute the average *and* standard deviation.

    perf module: http://perf.readthedocs.io/

    Good luck with benchmarks, it's a nightmare :-D

    ReplyDelete
    Replies
    1. Hi, and thanks for sharing your experience and insights. I will definitely experiment with some of the factors on your list.

      Concerning ASLR, I agree that if you repeat the benchmark enough times, ASLR should have little impact on the average. This is a good approach for microbenchmarks. However, in my setting a test run may take hours, so very many repetitions is not an option. The reason I want to increase the stability, is to be able to get a representative result with just a few repetitions.

      Delete
  2. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2016/02/05/setup-mysql-database-for-bugzilla/

    ReplyDelete
  3. I know you mentioned different governors for CPU frequency scaling and tuning of the Intel pstate driver did not have an effect.

    If you have root, then a quick way to dynamically control C-states can be to use the Power management Quality of Service (PM QOS) interface, /dev/cpu_dma_latency, can be used to prevent the processor from entering deep sleep states and causing unexpected latencies when exiting deep sleep states. Opening /dev/cpu_dma_latency and writing a zero to it will prevent transitions to deep sleep states while the file descriptor is held open. Additionally, writing a zero to it emulates the idle=poll behavior.

    https://access.redhat.com/articles/65410
    https://www.kernel.org/doc/Documentation/power/pm_qos_interface.txt

    Page 7 of "Controlling Processor C-State Usage in Linux":
    http://en.community.dell.com/cfs-file/__key/telligent-evolution-components-attachments/13-4491-00-00-20-22-77-64/Controlling_5F00_Processor_5F00_C_2D00_State_5F00_Usage_5F00_in_5F00_Linux_5F00_v1.1_5F00_Nov2013.pdf

    contains setcpulatency.c which can be used to dynamically control C-states. Also here:

    https://github.com/gtcasl/hpc-benchmarks/blob/master/setcpulatency.c

    It can be very convenient as it is dynamic.
    But you would need consistent cooling for stable benchmarks.

    To avoid bugs in BIOS/CPU drivers (and temperature) from impacting stability, you can go the opposite way and set CPU frequency to the lowest ( also suggested in https://haypo.github.io/intel-cpus-part2.html ).

    ReplyDelete
    Replies
    1. Hi Peter! Thanks for this very useful input. I was not aware of this interface, and I will check if it has any effects on my tests. However, I am not sure that power saving is one of my biggest problems. I would think the warm-up period is more than long enough to wake up the cores that I am running on. Once the warmup period is over, there will be no disk I/O so the core should be fully occupied with executing the queries. Hence, I doubt that it will enter deep C-states during the test. However, I will need to verify this.

      Delete