EDIT: Issue fixed with Puma version 4.3.1.
A security vulnerability and performance issue exists in the Ruby application server Puma. Puma may delay processing requests from some clients.
This issue may result in degraded performance (at best) or a temporary DoS for some/all clients.
The solution would be to update to a patched version of Puma (when it becomes available) or switch servers (i.e., iodine).
Discovering the Issue (thanks @ioquatix)
During September 2019, I was asked about adding the Plezi framework to a Web Framework benchmark repo on GitHub.
This led to a comment by Samuel Williams (@ioquatix on GitHub) pointing out that wrk
doesn’t report blocked clients where the Puma server is concerned. Samuel linked to a YouTube video demonstrating this issue using a patched fork of wrk
.
In essence, Samuel showed that the Puma server serves a client at a time, not a request at a time, allowing some clients to dominate the server (I suspect this is due to this line of code).
Now, while Samuel was demonstrating a benchmarking bias issue with wrk
, my (slightly paranoid) mind was thinking DoS attack vector.
I mean, sure, it sounds like such a small thing – but it means that some clients get blocked when running a Puma server.
In Numbers (short request)
For example, when 10 clients are sending concurrent requests to Puma and Puma is limited to 2 workers and 2 threads (per worker), only 4 clients get their requests handled (2×2):
# server cmd: $ puma -w 2 -t 2 -p 3000 # benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/ # puma version: 4.1.1 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 30248 requests completed connection 1: 30432 requests completed connection 2: 30247 requests completed connection 3: 1 requests completed connection 4: 30438 requests completed connection 5: 1 requests completed connection 6: 1 requests completed connection 7: 1 requests completed connection 8: 1 requests completed connection 9: 1 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 131.57us 20.95us 1.63ms 92.95% Req/Sec 29.73k 4.73k 31.30k 97.56% 121371 requests in 4.10s, 21.53MB read Socket errors: connect 0, read 0, write 0, timeout 6 Requests/sec: 29590.77 Transfer/sec: 5.25MB
In this example, connections 0-2 and 4 had their requests handled. They were the “clients” served by Puma.
Connections 3 and 5-9 never got served during the benchmark. Their requests were completed only during the shutdown phase (when the other clients stopped sending requests).
More importantly, using the original version of wrk
, we would never have known that these connections had to wait for the other (misbehaving?) connections. In addition, their huge latency doesn’t show in the benchmark and (in essence) the benchmark behaves as if it’s testing only 4 concurrent clients.
This shouldn’t happen and it doesn’t happen with every server.
For example, with iodine (2 workers, 2 threads per worker), the server serves a request at a time, meaning all clients get served:
# server cmd: $ iodine -w 2 -t 2 -p 3000 # benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/ # iodine version: 0.7.33 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 49729 requests completed connection 1: 49721 requests completed connection 2: 47599 requests completed connection 3: 49722 requests completed connection 4: 49723 requests completed connection 5: 47598 requests completed connection 6: 47597 requests completed connection 7: 47602 requests completed connection 8: 47600 requests completed connection 9: 47595 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 83.66us 37.29us 2.06ms 95.08% Req/Sec 118.91k 19.49k 128.13k 95.12% 484486 requests in 4.11s, 133.07MB read Requests/sec: 117992.86 Transfer/sec: 32.41MB
Falcon, a fiber based server, shows a balanced client handling distribution, with speeds similar to Puma:
# server cmd: $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 12063 requests completed connection 1: 12049 requests completed connection 2: 12063 requests completed connection 3: 12049 requests completed connection 4: 12063 requests completed connection 5: 12049 requests completed connection 6: 12062 requests completed connection 7: 12049 requests completed connection 8: 12062 requests completed connection 9: 12049 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 355.84us 377.37us 12.23ms 94.27% Req/Sec 29.58k 4.84k 31.29k 95.12% 120558 requests in 4.10s, 21.38MB read Requests/sec: 29385.91 Transfer/sec: 5.21MB
In Numbers (long requests)
Short requests are nice. They finish in a heartbeat… especially when I’m running both the server and wrk
on my local machine (no network delays / overhead).
Longer requests might also run super fast on my local machine, but they require the data to be buffered and “fast” is now more relative then biased.
I was wondering – when the responses are buffered, does Puma rotate clients or does the security issue persist?
This is an important question, since some clients might be slower then others.
So I wrote a small Rack application that serves the wrk-master.zip
file I downloaded from Samuel’s forked repo. It’s the version that reports all the unserved clients and it’s about 9Mb in size.
The issue persisted for long responses – Puma would wait for the clients to download and process the same client’s new requests before serving other clients (only connections 2-5 are served):
# server cmd: $ puma -w 2 -t 2 -p 3000 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 7 requests completed connection 1: 7 requests completed connection 2: 7 requests completed connection 3: 7 requests completed connection 4: 1 requests completed connection 5: 1 requests completed connection 6: 1 requests completed connection 7: 1 requests completed connection 8: 1 requests completed connection 9: 1 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 589.81ms 21.89ms 635.66ms 85.71% Req/Sec 5.89 1.05 8.00 77.78% 34 requests in 5.20s, 301.29MB read Socket errors: connect 0, read 0, write 0, timeout 6 Requests/sec: 6.54 Transfer/sec: 57.92MB
This shouldn’t happen and it doesn’t happen with every server.
With iodine (even without it’s X-Sendfile
support*), all clients are served concurrently, improving performance significantly, albeit with an uneven distribution:
# server cmd: $ iodine -w 2 -t 2 -p 3000 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 44 requests completed connection 1: 21 requests completed connection 2: 47 requests completed connection 3: 22 requests completed connection 4: 15 requests completed connection 5: 19 requests completed connection 6: 18 requests completed connection 7: 20 requests completed connection 8: 43 requests completed connection 9: 21 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 151.48ms 75.48ms 343.21ms 57.45% Req/Sec 63.98 21.09 101.00 61.90% 270 requests in 4.23s, 2.34GB read Requests/sec: 63.84 Transfer/sec: 565.72MB
Falcon performed even better when sending large responses (ignoring iodine’s X-Sendfile
support), and it doesn’t have any issues with blocked clients:
# server cmd: $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000 Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 55 requests completed connection 1: 55 requests completed connection 2: 39 requests completed connection 3: 39 requests completed connection 4: 39 requests completed connection 5: 37 requests completed connection 6: 55 requests completed connection 7: 40 requests completed connection 8: 39 requests completed connection 9: 55 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 89.46ms 17.15ms 142.12ms 67.33% Req/Sec 111.72 20.31 141.00 62.50% 453 requests in 4.11s, 3.92GB read Requests/sec: 110.32 Transfer/sec: 0.95GB
When I tested iodine’s X-Sendfile
, client distribution was more even, and iodine clocked at 355.05 req/sec, which is significantly faster then Puma’s per-client speed of 6.54 req/sec:
# server cmd: $ iodine -w 2 -t 2 -p 3000 -www ./ Running 4s test @ http://localhost:3000/ 1 threads and 10 connections connection 0: 146 requests completed connection 1: 146 requests completed connection 2: 145 requests completed connection 3: 146 requests completed connection 4: 146 requests completed connection 5: 145 requests completed connection 6: 146 requests completed connection 7: 146 requests completed connection 8: 146 requests completed connection 9: 146 requests completed Thread Stats Avg Stdev Max +/- Stdev Latency 27.55ms 8.44ms 50.66ms 65.23% Req/Sec 355.42 47.25 383.00 97.37% 1458 requests in 4.11s, 12.62GB read Requests/sec: 355.05 Transfer/sec: 3.07GB
* Note: to run iodine with X-Sendfile
support enabled, the static file server option must be enabled by pointing at a public folder. This changes the command line for iodine, adding -www ./
to serve static files from the current folder.
Is this a Security Vulnerability?
In general, yes, this is a security vulnerability and it should be fixed. EDIT: Issue fixed with Puma version 4.3.1.
However, the issue is probably masked and (mostly) protected against by the load-balancer / proxy.
Assuming you’re running behind a load-balancer / proxy, such as Nginx, with rate limiting enabled, you’re probably (or mostly) covered.
Even without a rate limiter, (but assuming a connection limiter, such as Nginx with max_conn
) clients wishing to abuse this attack vector might need to use a DDos (Distributed DoS) approach to allow for enough connections when Puma is running behind a proxy. This could make this attack vector less attractive.
Please note, attacks may still be possible even with these protections enabled – so it’s important to address the issue as well as make sure your proxy is set up correctly.
Also, if you’re exposing Puma directly to the internet, this is definitely a security threat (and it might not be the only one you should be worried about).
Does the Effect Performance?
Yes, this issue would negatively impact performance, at the very least where latency is concerned.
I do not know if, after solving the issue, Puma would server more requests per second. However, even if the speed does not improve, solving this issue would improve overall (average) latency.
I assume that we will see improved performance once a fix is introduced to Puma, at the very least where latency is concerned.
Note that not all benchmarking tools would show the improved performance, since some of them don’t measure latency for blocked clients (only gods know why).
Does this Issue Cost Money?
Probably.
If your application has a high load of users (imagine GitHub running on Puma), this issue would introduce request timeouts because some clients are being blocked.
Without knowledge of this issue, one might believe that the server is too slow / too busy to serve all clients and horizontal scaling would ensue.
This, often automated, response to the issue would increase operating costs where changing the application server would have been a cheaper solution.
In fact, this isn’t a question of the number of users, but rather a question of user concurrency. Even a small number of users that “clock in” at the same time will cause spikes or even result in DoS (I’m thinking of school systems such as OL).
This is aggravated by the fact that most clients may start by behaving as busy clients. Because most web pages include more then a single resource that needs to be loaded (in addition to AJAX requests, etc’), even non-malicious clients will send multiple requests.
Are Heroku and Rails Wrong?
Often developers quote to me the fact that Heroku recommends Puma.
However, this recommendation is outdated. It’s from 2015. Also, as we can see, benchmarking and testing tools don’t always show the true picture – wrk
definitely failed to show this issue.
Sometimes developers point out that Rails uses Puma as their default server.
I doubt if the Rails teem was aware of the issue. Besides, Rails needs a default server that works on all systems where possible (JRuby, Ruby MRI, etc’), making Puma a better candidate then platform specific servers (iodine is aimed only at serving Ruby MRI).
It is my humble opinion that the question of choosing a server should be revisited.
What to Do?
I ran the tests with Puma 4.1.1 on MacOS and Puma 4.2.0 on Ubuntu.
I assume that by the time you read this, the issue would be fixed and the solution would be to upgrade your Puma server to the latest version.
Even though the data is already out there (on YouTube), I contacted Evan Phoenix (@evanphx on GitHub) about this issue before I publish this information. He informed me that a fix / patch should be coming soon.
If you read this before Puma published a fix, I recommend you switch to iodine. The change should be easy enough and painless (in fact, you might see performance gains).
Just remember that, if you do switch to iodine: while Puma prefers configuration files, iodine prefers command line arguments and environment variables.
Testing Code
The short request Rack application used for the benchmarks was (placed in config.ru
):
module ShortExample # This is the HTTP response object according to the Rack specification. HTTP_RESPONSE = [200, { 'Content-Type' => 'text/html', 'Content-Length' => '121' }, ['Hello my dear world, this is a simple testing applications that returns the same string 100% of the time. Somewhat boring']] # this is function will be called by the Rack server for every request. def self.call env # simply return the RESPONSE object, no matter what request was received. HTTP_RESPONSE end end run ShortExample
The long request Rack application used for the benchmarks was (placed in config.ru
):
module LongerExample # this is function will be called by the Rack server for every request. def self.call env # open the `wrk-master.zip` file (downloaded from: https://github.com/ioquatix/wrk) # serve the file. # When using X-Sendfile (supported by iodine), uncomment the following line: # return [200, {'X-Sendfile' => './wrk-master.zip'}, []] file = File.open('./wrk-master.zip') [200, {'Content-Length' => file.size}, file] end end run LongerExample
The terminal command used to run Puma:
puma -w 2 -t 2 -p 3000
The terminal command used to run iodine:
iodine -w 2 -t 2 -p 3000
The terminal command used to run Falcon (avoiding bias based on setting):
falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000
To get X-Sendfile
support, iodine must run with a public folder specified:
iodine -w 2 -t 2 -p 3000 -www ./
(1. EDITED to add falcon tests and fix spelling / names)
(2. EDITED to add risk clarifications)