Puma Server: a Client at a Time (security issue)

EDIT: Issue fixed with Puma version 4.3.1.

A security vulnerability and performance issue exists in the Ruby application server Puma. Puma may delay processing requests from some clients.

This issue may result in degraded performance (at best) or a temporary DoS for some/all clients.

The solution would be to update to a patched version of Puma (when it becomes available) or switch servers (i.e., iodine).

Discovering the Issue (thanks @ioquatix)

During September 2019, I was asked about adding the Plezi framework to a Web Framework benchmark repo on GitHub.

This led to a comment by Samuel Williams (@ioquatix on GitHub) pointing out that wrk doesn’t report blocked clients where the Puma server is concerned. Samuel linked to a YouTube video demonstrating this issue using a patched fork of wrk.

In essence, Samuel showed that the Puma server serves a client at a time, not a request at a time, allowing some clients to dominate the server (I suspect this is due to this line of code).

Now, while Samuel was demonstrating a benchmarking bias issue with wrk, my (slightly paranoid) mind was thinking DoS attack vector.

I mean, sure, it sounds like such a small thing – but it means that some clients get blocked when running a Puma server.

In Numbers (short request)

For example, when 10 clients are sending concurrent requests to Puma and Puma is limited to 2 workers and 2 threads (per worker), only 4 clients get their requests handled (2×2):

# server cmd:    $ puma -w 2 -t 2 -p 3000
# benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/
# puma version:  4.1.1
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 30248 requests completed
connection 1: 30432 requests completed
connection 2: 30247 requests completed
connection 3: 1 requests completed
connection 4: 30438 requests completed
connection 5: 1 requests completed
connection 6: 1 requests completed
connection 7: 1 requests completed
connection 8: 1 requests completed
connection 9: 1 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   131.57us   20.95us   1.63ms   92.95%
    Req/Sec    29.73k     4.73k   31.30k    97.56%
  121371 requests in 4.10s, 21.53MB read
  Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec:  29590.77
Transfer/sec:      5.25MB

In this example, connections 0-2 and 4 had their requests handled. They were the “clients” served by Puma.

Connections 3 and 5-9 never got served during the benchmark. Their requests were completed only during the shutdown phase (when the other clients stopped sending requests).

More importantly, using the original version of wrk, we would never have known that these connections had to wait for the other (misbehaving?) connections. In addition, their huge latency doesn’t show in the benchmark and (in essence) the benchmark behaves as if it’s testing only 4 concurrent clients.

This shouldn’t happen and it doesn’t happen with every server.

For example, with iodine (2 workers, 2 threads per worker), the server serves a request at a time, meaning all clients get served:

# server cmd:    $ iodine -w 2 -t 2 -p 3000
# benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/
# iodine version: 0.7.33
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 49729 requests completed
connection 1: 49721 requests completed
connection 2: 47599 requests completed
connection 3: 49722 requests completed
connection 4: 49723 requests completed
connection 5: 47598 requests completed
connection 6: 47597 requests completed
connection 7: 47602 requests completed
connection 8: 47600 requests completed
connection 9: 47595 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    83.66us   37.29us   2.06ms   95.08%
    Req/Sec   118.91k    19.49k  128.13k    95.12%
  484486 requests in 4.11s, 133.07MB read
Requests/sec: 117992.86
Transfer/sec:     32.41MB

Falcon, a fiber based server, shows a balanced client handling distribution, with speeds similar to Puma:

# server cmd:    $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 12063 requests completed
connection 1: 12049 requests completed
connection 2: 12063 requests completed
connection 3: 12049 requests completed
connection 4: 12063 requests completed
connection 5: 12049 requests completed
connection 6: 12062 requests completed
connection 7: 12049 requests completed
connection 8: 12062 requests completed
connection 9: 12049 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   355.84us  377.37us  12.23ms   94.27%
    Req/Sec    29.58k     4.84k   31.29k    95.12%
  120558 requests in 4.10s, 21.38MB read
Requests/sec:  29385.91
Transfer/sec:      5.21MB

In Numbers (long requests)

Short requests are nice. They finish in a heartbeat… especially when I’m running both the server and wrk on my local machine (no network delays / overhead).

Longer requests might also run super fast on my local machine, but they require the data to be buffered and “fast” is now more relative then biased.

I was wondering – when the responses are buffered, does Puma rotate clients or does the security issue persist?

This is an important question, since some clients might be slower then others.

So I wrote a small Rack application that serves the wrk-master.zip file I downloaded from Samuel’s forked repo. It’s the version that reports all the unserved clients and it’s about 9Mb in size.

The issue persisted for long responses – Puma would wait for the clients to download and process the same client’s new requests before serving other clients (only connections 2-5 are served):

# server cmd:    $ puma -w 2 -t 2 -p 3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 7 requests completed
connection 1: 7 requests completed
connection 2: 7 requests completed
connection 3: 7 requests completed
connection 4: 1 requests completed
connection 5: 1 requests completed
connection 6: 1 requests completed
connection 7: 1 requests completed
connection 8: 1 requests completed
connection 9: 1 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   589.81ms   21.89ms 635.66ms   85.71%
    Req/Sec     5.89      1.05     8.00     77.78%
  34 requests in 5.20s, 301.29MB read
  Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec:      6.54
Transfer/sec:     57.92MB

This shouldn’t happen and it doesn’t happen with every server.

With iodine (even without it’s X-Sendfile support*), all clients are served concurrently, improving performance significantly, albeit with an uneven distribution:

# server cmd:    $ iodine -w 2 -t 2 -p 3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 44 requests completed
connection 1: 21 requests completed
connection 2: 47 requests completed
connection 3: 22 requests completed
connection 4: 15 requests completed
connection 5: 19 requests completed
connection 6: 18 requests completed
connection 7: 20 requests completed
connection 8: 43 requests completed
connection 9: 21 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   151.48ms   75.48ms 343.21ms   57.45%
    Req/Sec    63.98     21.09   101.00     61.90%
  270 requests in 4.23s, 2.34GB read
Requests/sec:     63.84
Transfer/sec:    565.72MB

Falcon performed even better when sending large responses (ignoring iodine’s X-Sendfile support), and it doesn’t have any issues with blocked clients:

# server cmd:    $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 55 requests completed
connection 1: 55 requests completed
connection 2: 39 requests completed
connection 3: 39 requests completed
connection 4: 39 requests completed
connection 5: 37 requests completed
connection 6: 55 requests completed
connection 7: 40 requests completed
connection 8: 39 requests completed
connection 9: 55 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    89.46ms   17.15ms 142.12ms   67.33%
    Req/Sec   111.72     20.31   141.00     62.50%
  453 requests in 4.11s, 3.92GB read
Requests/sec:    110.32
Transfer/sec:      0.95GB

When I tested iodine’s X-Sendfile, client distribution was more even, and iodine clocked at 355.05 req/sec, which is significantly faster then Puma’s per-client speed of 6.54 req/sec:

# server cmd:    $ iodine -w 2 -t 2 -p 3000 -www ./
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 146 requests completed
connection 1: 146 requests completed
connection 2: 145 requests completed
connection 3: 146 requests completed
connection 4: 146 requests completed
connection 5: 145 requests completed
connection 6: 146 requests completed
connection 7: 146 requests completed
connection 8: 146 requests completed
connection 9: 146 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.55ms    8.44ms  50.66ms   65.23%
    Req/Sec   355.42     47.25   383.00     97.37%
  1458 requests in 4.11s, 12.62GB read
Requests/sec:    355.05
Transfer/sec:      3.07GB

* Note: to run iodine with X-Sendfile support enabled, the static file server option must be enabled by pointing at a public folder. This changes the command line for iodine, adding -www ./ to serve static files from the current folder.

Is this a Security Vulnerability?

In general, yes, this is a security vulnerability and it should be fixed. EDIT: Issue fixed with Puma version 4.3.1.

However, the issue is probably masked and (mostly) protected against by the load-balancer / proxy.

Assuming you’re running behind a load-balancer / proxy, such as Nginx, with rate limiting enabled, you’re probably (or mostly) covered.

Even without a rate limiter, (but assuming a connection limiter, such as Nginx with max_conn) clients wishing to abuse this attack vector might need to use a DDos (Distributed DoS) approach to allow for enough connections when Puma is running behind a proxy. This could make this attack vector less attractive.

Please note, attacks may still be possible even with these protections enabled – so it’s important to address the issue as well as make sure your proxy is set up correctly.

Also, if you’re exposing Puma directly to the internet, this is definitely a security threat (and it might not be the only one you should be worried about).

Does the Effect Performance?

Yes, this issue would negatively impact performance, at the very least where latency is concerned.

I do not know if, after solving the issue, Puma would server more requests per second. However, even if the speed does not improve, solving this issue would improve overall (average) latency.

I assume that we will see improved performance once a fix is introduced to Puma, at the very least where latency is concerned.

Note that not all benchmarking tools would show the improved performance, since some of them don’t measure latency for blocked clients (only gods know why).

Does this Issue Cost Money?

Probably.

If your application has a high load of users (imagine GitHub running on Puma), this issue would introduce request timeouts because some clients are being blocked.

Without knowledge of this issue, one might believe that the server is too slow / too busy to serve all clients and horizontal scaling would ensue.

This, often automated, response to the issue would increase operating costs where changing the application server would have been a cheaper solution.

In fact, this isn’t a question of the number of users, but rather a question of user concurrency. Even a small number of users that “clock in” at the same time will cause spikes or even result in DoS (I’m thinking of school systems such as OL).

This is aggravated by the fact that most clients may start by behaving as busy clients. Because most web pages include more then a single resource that needs to be loaded (in addition to AJAX requests, etc’), even non-malicious clients will send multiple requests.

Are Heroku and Rails Wrong?

Often developers quote to me the fact that Heroku recommends Puma.

However, this recommendation is outdated. It’s from 2015. Also, as we can see, benchmarking and testing tools don’t always show the true picture – wrk definitely failed to show this issue.

Sometimes developers point out that Rails uses Puma as their default server.

I doubt if the Rails teem was aware of the issue. Besides, Rails needs a default server that works on all systems where possible (JRuby, Ruby MRI, etc’), making Puma a better candidate then platform specific servers (iodine is aimed only at serving Ruby MRI).

It is my humble opinion that the question of choosing a server should be revisited.

What to Do?

I ran the tests with Puma 4.1.1 on MacOS and Puma 4.2.0 on Ubuntu.

I assume that by the time you read this, the issue would be fixed and the solution would be to upgrade your Puma server to the latest version.

Even though the data is already out there (on YouTube), I contacted Evan Phoenix (@evanphx on GitHub) about this issue before I publish this information. He informed me that a fix / patch should be coming soon.

If you read this before Puma published a fix, I recommend you switch to iodine. The change should be easy enough and painless (in fact, you might see performance gains).

Just remember that, if you do switch to iodine: while Puma prefers configuration files, iodine prefers command line arguments and environment variables.


Testing Code

The short request Rack application used for the benchmarks was (placed in config.ru):

module ShortExample
  # This is the HTTP response object according to the Rack specification.
  HTTP_RESPONSE = [200, { 'Content-Type' => 'text/html',
            'Content-Length' => '121' },
   ['Hello my dear world, this is a simple testing applications that returns the same string 100% of the time. Somewhat boring']]

  # this is function will be called by the Rack server for every request.
  def self.call env
   # simply return the RESPONSE object, no matter what request was received.
   HTTP_RESPONSE
  end
end
run ShortExample

The long request Rack application used for the benchmarks was (placed in config.ru):

module LongerExample
  # this is function will be called by the Rack server for every request.
  def self.call env
   # open the `wrk-master.zip` file (downloaded from: https://github.com/ioquatix/wrk)
   # serve the file.
   # When using X-Sendfile (supported by iodine), uncomment the following line:
   #   return [200, {'X-Sendfile' => './wrk-master.zip'}, []]
   file = File.open('./wrk-master.zip')
   [200, {'Content-Length' => file.size}, file]
  end
end
run LongerExample

The terminal command used to run Puma:

puma -w 2 -t 2 -p 3000

The terminal command used to run iodine:

iodine -w 2 -t 2 -p 3000

The terminal command used to run Falcon (avoiding bias based on setting):

falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000

To get X-Sendfile support, iodine must run with a public folder specified:

iodine -w 2 -t 2 -p 3000 -www ./

(1. EDITED to add falcon tests and fix spelling / names)

(2. EDITED to add risk clarifications)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s