Theory vs. Practice
Diagnosis is not the end, but the beginning of practice.
Critical wrk2 bug: all wrk2 benchmarks since 2012 are bogus
As you will see, benchmarking is not a walk in a park.
In 2023-2024, I first used wrk, which takes forever to complete benchmarks with a fast server because wrk attempts to count all the server replies (if the server takes 10 seconds to complete the test, and wrk is 500 times slower than the server, then wrk will need 500 * 10 sconds = 5000 seconds = 1 hour 23 minutes to complete the test).
Late 2024, an engineer suggested that I should use wrk2 because it is slower but more reliable... and it stops at the specified time (instead of taking forever).
In April 2025, I published new [1k-40k users] benchmarks (G-WAN reaching 242m RPS at 10k users). But a few months later, I discovered that installing wrk2 on new machines was crashing at... 10k users.
This was odd because 10k users at 242m RPS was the concurrency where G-WAN was vaporizing NGINX and others (which top with less than 1m RPS at 1k users). But I did not have time to fix wrk2, and I was thinking that writting a G-WAN-based benchmark would be a much better value-proposition than fixing the slow wrk2.
Near September 2025, I noticed that an OS update had slowed-down G-WAN from 242m RPS to 8m RPS (so I wrote the G-WAN cache to bypass the 'faulty' Linux kernel syscall – restoring G-WAN performance to 281m RPS at 10k users).
I though I was safe from this point. But in April 2026 I have been told that creating many threads could take so much time that wrk2 could leave no time to the actual benchmark. The person suggested this patch, where stop_at is created after start (wrk2 was creating stop_at before start and the creation of threads!):
--- a/src/wrk.c
+++ b/src/wrk.c
@@ -122,7 +122,8 @@
uint64_t connections = cfg.connections / cfg.threads;
double throughput = (double)cfg.rate / cfg.threads;
- uint64_t stop_at = time_us() + (cfg.duration * 1000000);
+ uint64_t start = time_us();
+ uint64_t stop_at = start + (cfg.duration * 1000000);
for (uint64_t i = 0; i < cfg.threads; i++) {
thread *t = &threads[i];
@@ -163,7 +164,6 @@
printf(" %"PRIu64" threads and %"PRIu64" connections\n",
cfg.threads, cfg.connections);
- uint64_t start = time_us();
uint64_t complete = 0;
uint64_t bytes = 0;
errors errors = { 0 };
I have promised to investigate further, and have discovered that the situation was much worse than presented, as the proposed patch would not fix the main issue:
When thread calibration takes too much time (default: 10 seconds but this duration is extended by the number of connections!), since wrk2 setups a stop_at time before creating the threads and a start time after creating and calibrating the threads, the calculation of RPS req_per_s = complete / runtime_s turns the division into a multiplication (leading to bogus values) when the test time (default 10 seconds) is reduced by this bug to less than 1 second.
The obvious fix was to do this in wrk.c, not in main() but rather for the threads' function:
thread->start = time_us(); thread->stop_at = thread->start + (cfg.duration * 1000000); // <= THE FIX aeMain(loop);
With this single line, we guaranty that every single thread will execute for (at least) the user-specified time. wrk2 benchmark will last longer than before because the thread calibration time will not be substracted from the thread benchmarking execution time (they will be cumulated). And, probably, not all threads will end at the same time, making benchmarks last even longer.
SO, MOST BENCHMARKS DONE WITH WRK2 SINCE 2012 ARE... BOGUS (AND NOBODY HAS EVER NOTICED).
wrk2 has been first published in 2012 by Gil Tene. In 2026, this is a 14-year old major bug for a "A constant throughput, correct latency recording variant of wrk" (wrk, created by Will Glozer, does not have this bug, but as we have seen, it has other problems).
After fixing wrk2's latest available source code and recompiling it, I quickly tested it and... it crashed at 10k users.
I re-downloaded wrk2 from several sources to compare it to the version I downloaded in October 2024. In the 2024 source code, the fatal bug was already there... but this 2024 version of wrk2 (published before the April 2025 G-WAN benchmarks) had no problem to test up to 40k users without crashing (at 50k users wrk2 is "Terminated" by the kernel OOM kill-switch, for using 190+ GB on my 192 GB RAM machine... while G-WAN is consuming less than 700 MB of RAM).
In the newest versions of wrk2 avaiable on Github, Ubuntu repositories, etc., the Makefile has also changed and the resulting executable file is now 10 times smaller than before – but it crashes at 10k users.
If someone wanted to sabotage the tool that allowed G-WAN to shine, it would not have done something else. I hardly see why and how wrk2 crashing at 10k+ users is a progress for a benchmark tool widely considered as the "best of its class".
So, I now have a version of wrk2 (without the bogus-RPS bug) that doesn't crash at 10k users... and is much easier to compile since (1) it comes with all its dependencies and (2) has a Makefile using them.
And guess what, my version of wrk2 is much, much faster than the unpatched version: G-WAN now tops at 469m RPS at 10k users on the same machine where the same (relatively old) version of G-WAN topped at 281m RPS.
The latest G-WAN is now much, much faster, but that will be for another blog post.
And I share my patched version of wrk2 with the world, both to let people test their own works and G-WAN (HTTP(S) server, Web applications, and caching reverse proxy).