Puma Server: a Client at a Time (security issue)

EDIT: Issue fixed with Puma version 4.3.1.

A security vulnerability and performance issue exists in the Ruby application server Puma. Puma may delay processing requests from some clients.

This issue may result in degraded performance (at best) or a temporary DoS for some/all clients.

The solution would be to update to a patched version of Puma (when it becomes available) or switch servers (i.e., iodine).

Discovering the Issue (thanks @ioquatix)

During September 2019, I was asked about adding the Plezi framework to a Web Framework benchmark repo on GitHub.

This led to a comment by Samuel Williams (@ioquatix on GitHub) pointing out that wrk doesn’t report blocked clients where the Puma server is concerned. Samuel linked to a YouTube video demonstrating this issue using a patched fork of wrk.

In essence, Samuel showed that the Puma server serves a client at a time, not a request at a time, allowing some clients to dominate the server (I suspect this is due to this line of code).

Now, while Samuel was demonstrating a benchmarking bias issue with wrk, my (slightly paranoid) mind was thinking DoS attack vector.

I mean, sure, it sounds like such a small thing – but it means that some clients get blocked when running a Puma server.

In Numbers (short request)

For example, when 10 clients are sending concurrent requests to Puma and Puma is limited to 2 workers and 2 threads (per worker), only 4 clients get their requests handled (2×2):

# server cmd:    $ puma -w 2 -t 2 -p 3000
# benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/
# puma version:  4.1.1
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 30248 requests completed
connection 1: 30432 requests completed
connection 2: 30247 requests completed
connection 3: 1 requests completed
connection 4: 30438 requests completed
connection 5: 1 requests completed
connection 6: 1 requests completed
connection 7: 1 requests completed
connection 8: 1 requests completed
connection 9: 1 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   131.57us   20.95us   1.63ms   92.95%
    Req/Sec    29.73k     4.73k   31.30k    97.56%
  121371 requests in 4.10s, 21.53MB read
  Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec:  29590.77
Transfer/sec:      5.25MB

In this example, connections 0-2 and 4 had their requests handled. They were the “clients” served by Puma.

Connections 3 and 5-9 never got served during the benchmark. Their requests were completed only during the shutdown phase (when the other clients stopped sending requests).

More importantly, using the original version of wrk, we would never have known that these connections had to wait for the other (misbehaving?) connections. In addition, their huge latency doesn’t show in the benchmark and (in essence) the benchmark behaves as if it’s testing only 4 concurrent clients.

This shouldn’t happen and it doesn’t happen with every server.

For example, with iodine (2 workers, 2 threads per worker), the server serves a request at a time, meaning all clients get served:

# server cmd:    $ iodine -w 2 -t 2 -p 3000
# benchmark cmd: $ wrk -c10 -d4 -t1 http://localhost:3000/
# iodine version: 0.7.33
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 49729 requests completed
connection 1: 49721 requests completed
connection 2: 47599 requests completed
connection 3: 49722 requests completed
connection 4: 49723 requests completed
connection 5: 47598 requests completed
connection 6: 47597 requests completed
connection 7: 47602 requests completed
connection 8: 47600 requests completed
connection 9: 47595 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    83.66us   37.29us   2.06ms   95.08%
    Req/Sec   118.91k    19.49k  128.13k    95.12%
  484486 requests in 4.11s, 133.07MB read
Requests/sec: 117992.86
Transfer/sec:     32.41MB

Falcon, a fiber based server, shows a balanced client handling distribution, with speeds similar to Puma:

# server cmd:    $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 12063 requests completed
connection 1: 12049 requests completed
connection 2: 12063 requests completed
connection 3: 12049 requests completed
connection 4: 12063 requests completed
connection 5: 12049 requests completed
connection 6: 12062 requests completed
connection 7: 12049 requests completed
connection 8: 12062 requests completed
connection 9: 12049 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   355.84us  377.37us  12.23ms   94.27%
    Req/Sec    29.58k     4.84k   31.29k    95.12%
  120558 requests in 4.10s, 21.38MB read
Requests/sec:  29385.91
Transfer/sec:      5.21MB

In Numbers (long requests)

Short requests are nice. They finish in a heartbeat… especially when I’m running both the server and wrk on my local machine (no network delays / overhead).

Longer requests might also run super fast on my local machine, but they require the data to be buffered and “fast” is now more relative then biased.

I was wondering – when the responses are buffered, does Puma rotate clients or does the security issue persist?

This is an important question, since some clients might be slower then others.

So I wrote a small Rack application that serves the wrk-master.zip file I downloaded from Samuel’s forked repo. It’s the version that reports all the unserved clients and it’s about 9Mb in size.

The issue persisted for long responses – Puma would wait for the clients to download and process the same client’s new requests before serving other clients (only connections 2-5 are served):

# server cmd:    $ puma -w 2 -t 2 -p 3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 7 requests completed
connection 1: 7 requests completed
connection 2: 7 requests completed
connection 3: 7 requests completed
connection 4: 1 requests completed
connection 5: 1 requests completed
connection 6: 1 requests completed
connection 7: 1 requests completed
connection 8: 1 requests completed
connection 9: 1 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   589.81ms   21.89ms 635.66ms   85.71%
    Req/Sec     5.89      1.05     8.00     77.78%
  34 requests in 5.20s, 301.29MB read
  Socket errors: connect 0, read 0, write 0, timeout 6
Requests/sec:      6.54
Transfer/sec:     57.92MB

This shouldn’t happen and it doesn’t happen with every server.

With iodine (even without it’s X-Sendfile support*), all clients are served concurrently, improving performance significantly, albeit with an uneven distribution:

# server cmd:    $ iodine -w 2 -t 2 -p 3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 44 requests completed
connection 1: 21 requests completed
connection 2: 47 requests completed
connection 3: 22 requests completed
connection 4: 15 requests completed
connection 5: 19 requests completed
connection 6: 18 requests completed
connection 7: 20 requests completed
connection 8: 43 requests completed
connection 9: 21 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   151.48ms   75.48ms 343.21ms   57.45%
    Req/Sec    63.98     21.09   101.00     61.90%
  270 requests in 4.23s, 2.34GB read
Requests/sec:     63.84
Transfer/sec:    565.72MB

Falcon performed even better when sending large responses (ignoring iodine’s X-Sendfile support), and it doesn’t have any issues with blocked clients:

# server cmd:    $ falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 55 requests completed
connection 1: 55 requests completed
connection 2: 39 requests completed
connection 3: 39 requests completed
connection 4: 39 requests completed
connection 5: 37 requests completed
connection 6: 55 requests completed
connection 7: 40 requests completed
connection 8: 39 requests completed
connection 9: 55 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    89.46ms   17.15ms 142.12ms   67.33%
    Req/Sec   111.72     20.31   141.00     62.50%
  453 requests in 4.11s, 3.92GB read
Requests/sec:    110.32
Transfer/sec:      0.95GB

When I tested iodine’s X-Sendfile, client distribution was more even, and iodine clocked at 355.05 req/sec, which is significantly faster then Puma’s per-client speed of 6.54 req/sec:

# server cmd:    $ iodine -w 2 -t 2 -p 3000 -www ./
Running 4s test @ http://localhost:3000/
  1 threads and 10 connections
connection 0: 146 requests completed
connection 1: 146 requests completed
connection 2: 145 requests completed
connection 3: 146 requests completed
connection 4: 146 requests completed
connection 5: 145 requests completed
connection 6: 146 requests completed
connection 7: 146 requests completed
connection 8: 146 requests completed
connection 9: 146 requests completed
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    27.55ms    8.44ms  50.66ms   65.23%
    Req/Sec   355.42     47.25   383.00     97.37%
  1458 requests in 4.11s, 12.62GB read
Requests/sec:    355.05
Transfer/sec:      3.07GB

* Note: to run iodine with X-Sendfile support enabled, the static file server option must be enabled by pointing at a public folder. This changes the command line for iodine, adding -www ./ to serve static files from the current folder.

Is this a Security Vulnerability?

In general, yes, this is a security vulnerability and it should be fixed. EDIT: Issue fixed with Puma version 4.3.1.

However, the issue is probably masked and (mostly) protected against by the load-balancer / proxy.

Assuming you’re running behind a load-balancer / proxy, such as Nginx, with rate limiting enabled, you’re probably (or mostly) covered.

Even without a rate limiter, (but assuming a connection limiter, such as Nginx with max_conn) clients wishing to abuse this attack vector might need to use a DDos (Distributed DoS) approach to allow for enough connections when Puma is running behind a proxy. This could make this attack vector less attractive.

Please note, attacks may still be possible even with these protections enabled – so it’s important to address the issue as well as make sure your proxy is set up correctly.

Also, if you’re exposing Puma directly to the internet, this is definitely a security threat (and it might not be the only one you should be worried about).

Does the Effect Performance?

Yes, this issue would negatively impact performance, at the very least where latency is concerned.

I do not know if, after solving the issue, Puma would server more requests per second. However, even if the speed does not improve, solving this issue would improve overall (average) latency.

I assume that we will see improved performance once a fix is introduced to Puma, at the very least where latency is concerned.

Note that not all benchmarking tools would show the improved performance, since some of them don’t measure latency for blocked clients (only gods know why).

Does this Issue Cost Money?

Probably.

If your application has a high load of users (imagine GitHub running on Puma), this issue would introduce request timeouts because some clients are being blocked.

Without knowledge of this issue, one might believe that the server is too slow / too busy to serve all clients and horizontal scaling would ensue.

This, often automated, response to the issue would increase operating costs where changing the application server would have been a cheaper solution.

In fact, this isn’t a question of the number of users, but rather a question of user concurrency. Even a small number of users that “clock in” at the same time will cause spikes or even result in DoS (I’m thinking of school systems such as OL).

This is aggravated by the fact that most clients may start by behaving as busy clients. Because most web pages include more then a single resource that needs to be loaded (in addition to AJAX requests, etc’), even non-malicious clients will send multiple requests.

Are Heroku and Rails Wrong?

Often developers quote to me the fact that Heroku recommends Puma.

However, this recommendation is outdated. It’s from 2015. Also, as we can see, benchmarking and testing tools don’t always show the true picture – wrk definitely failed to show this issue.

Sometimes developers point out that Rails uses Puma as their default server.

I doubt if the Rails teem was aware of the issue. Besides, Rails needs a default server that works on all systems where possible (JRuby, Ruby MRI, etc’), making Puma a better candidate then platform specific servers (iodine is aimed only at serving Ruby MRI).

It is my humble opinion that the question of choosing a server should be revisited.

What to Do?

I ran the tests with Puma 4.1.1 on MacOS and Puma 4.2.0 on Ubuntu.

I assume that by the time you read this, the issue would be fixed and the solution would be to upgrade your Puma server to the latest version.

Even though the data is already out there (on YouTube), I contacted Evan Phoenix (@evanphx on GitHub) about this issue before I publish this information. He informed me that a fix / patch should be coming soon.

If you read this before Puma published a fix, I recommend you switch to iodine. The change should be easy enough and painless (in fact, you might see performance gains).

Just remember that, if you do switch to iodine: while Puma prefers configuration files, iodine prefers command line arguments and environment variables.


Testing Code

The short request Rack application used for the benchmarks was (placed in config.ru):

module ShortExample
  # This is the HTTP response object according to the Rack specification.
  HTTP_RESPONSE = [200, { 'Content-Type' => 'text/html',
            'Content-Length' => '121' },
   ['Hello my dear world, this is a simple testing applications that returns the same string 100% of the time. Somewhat boring']]

  # this is function will be called by the Rack server for every request.
  def self.call env
   # simply return the RESPONSE object, no matter what request was received.
   HTTP_RESPONSE
  end
end
run ShortExample

The long request Rack application used for the benchmarks was (placed in config.ru):

module LongerExample
  # this is function will be called by the Rack server for every request.
  def self.call env
   # open the `wrk-master.zip` file (downloaded from: https://github.com/ioquatix/wrk)
   # serve the file.
   # When using X-Sendfile (supported by iodine), uncomment the following line:
   #   return [200, {'X-Sendfile' => './wrk-master.zip'}, []]
   file = File.open('./wrk-master.zip')
   [200, {'Content-Length' => file.size}, file]
  end
end
run LongerExample

The terminal command used to run Puma:

puma -w 2 -t 2 -p 3000

The terminal command used to run iodine:

iodine -w 2 -t 2 -p 3000

The terminal command used to run Falcon (avoiding bias based on setting):

falcon --hybrid --threads 2 --forks 2 -b http://localhost:3000

To get X-Sendfile support, iodine must run with a public folder specified:

iodine -w 2 -t 2 -p 3000 -www ./

(1. EDITED to add falcon tests and fix spelling / names)

(2. EDITED to add risk clarifications)

Ruby’s Rack Push: Decoupling the real-time web application from the web

(UPDATED: updated to the PR are now reflected in the post)

Something exciting is coming.

Everyone is talking about WebSockets and their older cousin EventSource / Server Sent Events (SSE). Faye and ActionCable are all the rage and real-time updates are becoming easier than ever.

But it’s all a mess. It’s hard to set up, it’s hard to maintain. The performance is meh. In short, the existing design is expensive – it’s expensive in developer hours and it’s expensive in hardware costs.

However, a new PR in the Rack repository promises to change all that in the near future.

This PR is a huge step towards simplifying our code base, improving real-time performance and lowering the overall cost of real-time web applications.

In a sentence, it’s an important step towards decoupling the web application from the web.

Remember, Rack is the interface Ruby frameworks (such and Rails and Sinatra) and web applications use to communicate with the Ruby application servers. It’s everywhere. So this is a big deal.

The Problem in a Nutshell

The problem with the current standard approach, in a nutshell, is that each real-time application process has to run two servers in order to support real-time functionality.

The two servers might be listening on the same port, they might be hidden away in some gem, but at the end of the day, two different IO event handling units have to run side by side.

“Why?” you might ask. Well, since you asked, I’ll tell you (if you didn’t ask, skip to the solution).

The story of the temporary hijack

This is the story of a quick temporary solution coming up on it’s 5th year as the only “standard” Rack solution available.

At some point in our history, the Rack specification needed a way to support long polling and other HTTP techniques. Specifically, Rails 4.0 needed something for their “live stream” feature.

For this purpose, the Rack team came up with the hijack API approach.

This approach allowed for a quick fix to a pressing need. was meant to be temporary, something quick until Rack 2.0 was released (5 years later, the Rack protocol is still at version 1.3).

The hijack API offers applications complete control of the socket. Just hijack the socket away from the server and voilá, instant long polling / SSE support… sort of.

That’s where things started to get messy.

To handle the (now “free”) socket, a lot of network logic had to be copied from the server layer to the application layer (buffering write calls, handling incoming data, protocol management, timeout handling, etc’).

This is an obvious violation of the “S” in S.O.L.I.D (single responsibility), as it adds IO handling responsibilities to the application / framework.

It also violates the DRY principle, since the IO handling logic is now duplicated (once within the server and once within the application / framework).

Additionally, this approach has issues with HTTP/2 connections, since the network protocol and the application are now entangled.

The obvious hijack price

The hijack approach has many costs, some hidden, some more obvious.

The most easily observed price is memory, performance and developer hours.

Due to code duplication and extra work, the memory consumption for hijack based solutions is higher and their performance is slower (more system calls, more context switches, etc’).

Using require 'faye' will add WebSockets to your application, but it will take almost 9Mb just to load the gem (this is before any actual work was performed).

On the other hand, using the agoo or iodine HTTP servers will add both WebScokets and SSE to your application without any extra memory consumption.

To be more specific, using iodine will consume about 2Mb of memory, marginally less than Puma, while providing both HTTP and real-time capabilities.

The hidden hijack price

A more subtle price is higher hardware costs and a lower clients-per-machine ratio when using hijack.

Why?

Besides the degraded performance, the hijack approach allows some HTTP servers to lean on the select system call, (Puma used select last time I took a look).

This system call breaks down at around the 1024 open file limit, possibly limiting each process to 1024 open connections.

When a connection is hijacked, the sockets don’t close as fast as the web server expects, eventually leading to breakage and possible crashes if the 1024 open file limit is exceeded.

The Solution – Callbacks and Events

The new proposed Rack Push PR offers a wonderful effective way to implement WebSockets and SSE while allowing an application to remain totally server agnostic.

This new proposal leaves the responsibility for the network / IO handling with the server, simplifying the application’s code base and decoupling it from the network protocol.

By using a callback object, the application is notified of any events. Leaving the application free to focus on the data rather than the network layer.

The callback object doesn’t even need to know anything about the server running the application or the underlying protocol.

~~The callback object is automatically linked to the correct API using Ruby’s extend approach, allowing the application to remain totally server agnostic.~~ EDIT: the PR was updated, replacing the extend approach with an extra client object.

How it works

Every Rack server uses a Hash type object to communicate with a Rack application.

This is how Rails is built, this is how Sinatra is built and this is how every Rack application / framework is built. It’s in the current Rack specification.

A simple Hello world using Rack would look like this (placed in a file called config.ru):

# normal HTTP response
RESPONSE = [200, { 'Content-Type' => 'text/html',
          'Content-Length' => '12' }, [ 'Hello World!' ] ]
# note the `env` variable
APP = Proc.new {|env| RESPONSE }
# The Rack DSL used to run the application
run APP

This new proposal introduces the env['rack.upgrade?'] variable.

Normally, this variable is set to nil (or missing from the env Hash).

However, for WebSocket connection, the env['rack.upgrade?'] variable is set to :websocket and for EventSource (SSE) connections the variable is set to :sse.

To set a callback object, the env['rack.upgrade'] is introduced (notice the missing question mark).

Now the design might look like this:

# Place in config.ru
RESPONSE = [200, { 'Content-Type' => 'text/html',
          'Content-Length' => '12' }, [ 'Hello World!' ] ]
# an example Callback class
class MyCallbacks
  def on_open client
    puts "* Push connection opened."
  end
  def on_message client, data
    puts "* Incoming data: #{data}"
    client.write "Roger that, \"#{data}\""
  end
  def on_close client
    puts "* Push connection closed."
  end
end
# note the `env` variable
APP = Proc.new do |env|
  if(env['rack.upgrade?'])
    env['rack.upgrade'] = MyCallbacks.new
    [200, {}, []]
  else
    RESPONSE
  end
end
# The Rack DSL used to run the application
run APP

Run this application with the Agoo or Iodine servers and let the magic sparkle.

For example, using Iodine:

# install iodine, version 0.6.0 and up
gem install iodine
# start in single threaded mode
iodine -t 1

Now open the browser, visit localhost:3000 and open the browser console to test some JavaScript.

First try an EventSource (SSE) connection (run in browser console):

// An SSE example 
var source = new EventSource("/");
source.onmessage = function(msg) {
  console.log(msg.id);
  console.log(msg.data);
};

Sweet! nothing happened just yet (we aren't sending notifications), but we have an open SSE connection!

What about WebSockets (run in browser console):

// A WebSocket example 
ws = new WebSocket("ws://localhost:3000/");
ws.onmessage = function(e) { console.log(e.data); };
ws.onclose = function(e) { console.log("closed"); };
ws.onopen = function(e) { e.target.send("Hi!"); };

Wow! Did you look at the Ruby console – we have working WebSockets, it's that easy.

And this same example will run perfectly using the Agoo server as well (both Agoo and Iodine already support the Rack Push proposal).

Try it:

# install the agoo server, version 2.1.0 and up
gem install agoo
# start it up
rackup -s agoo -p 3000

Notice, no gems, no extra code, no huge memory consumption, just the Ruby server and raw Rack (I didn't even use a framework just yet).

The amazing push

So far, it's so simple, it's hard to notice how powerful this is.

Consider implementing a stock ticker, or in this case, a timer:

# Place in config.ru
RESPONSE = [200, { 'Content-Type' => 'text/html',
          'Content-Length' => '12' }, [ 'Hello World!' ] ]

# A global live connection storage
module LiveList
  @list = []
  @lock = Mutex.new
  def <<(connection)
    @lock.synchronize { @list << connection }
  end
  def >>(connection)
    @lock.synchronize { @list.delete connection }
  end
  def any?
    # remove connection to the "live list"
    @lock.synchronize { @list.any? }
  end
  # this will send a message to all the connections that share the same process.
  # (in cluster mode we get partial broadcasting only and this doesn't scale)
  def broadcast(data)
    # copy the list so we don't perform long operations in the critical section
    tmp = nil # place tmp in this part of the scope
    @lock.synchronize do
      tmp = @list.dup # copy list into tmp
    end
    # iterate list outside of critical section
    tmp.each {|c| c.write data }
  end
  extend self
end

# Broadcast the time very second... but...
# Threads will BREAK in cluster mode.
@thread = Thread.new do
  while(LiveList.any?) do
    sleep(1)
    LiveList.broadcast "The time is: #{Time.now}"
  end
end

# an example static Callback module
module MyCallbacks
  def on_open client
    # add connection to the "live list"
    LiveList << client
  end
  def on_message(client, data)
    # Just an example broadcast
    LiveList.broadcast "Special Announcement: #{data}"
  end
  def on_close client
    # remove connection to the "live list"
    LiveList >> client
  end
  extend self
end

# The Rack application
APP = Proc.new do |env|
  if(env['rack.upgrade?'])
    env['rack.upgrade'] = MyCallbacks
    [200, {}, []]
  else
    RESPONSE
  end
end
# The Rack DSL used to run the application
run APP

Run the iodine server in single process mode: iodine -w 1 and the little timer is ticking.

Honestly, I don’t love the code I just wrote for the previous example. It’s a little long, it’s slightly iffy and we can’t use iodine’s cluster mode.

For my next example, I’ll author a chat room in 32 lines (including comments).

I will use Iodine’s pub/sub extension API to avoid the LiveList module and the timer thread. I don’t want a timer, so I’ll skip the Iodine.run_every method.

Also, I’ll limit the interaction to WebSocket clients. Why? to show I can.

This will better demonstrate the power offered by the new env['rack.upgrade'] approach and it will also work in cluster mode.

Sadly, this means that the example won’t run on Agoo for now.

# Place in config.ru
RESPONSE = [200, { 'Content-Type' => 'text/html',
          'Content-Length' => '12' }, [ 'Hello World!' ] ]
CHAT = "chat".freeze
# a Callback class
class MyCallbacks
  def initialize env
     @name = env["PATH_INFO"][1..-1]
     @name = "unknown" if(@name.length == 0)
  end
  def on_open client
    client.subscribe CHAT
    client.publish CHAT, "#{@name} joined the chat."
  end
  def on_message client, data
    client.publish CHAT, "#{@name}: #{data}"
  end
  def on_close client
    client.publish CHAT, "#{@name} left the chat."
  end
end
# The actual Rack application
APP = Proc.new do |env|
  if(env['rack.upgrade?'] == :websocket)
    env['rack.upgrade'] = MyCallbacks.new(env)
    [200, {}, []]
  else
    RESPONSE
  end
end
# The Rack DSL used to run the application
run APP

Start the application from the command line (in terminal):

iodine

Now try (in the browser console):

ws = new WebSocket("ws://localhost:3000/Mitchel");
ws.onmessage = function(e) { console.log(e.data); };
ws.onclose = function(e) { console.log("Closed"); };
ws.onopen = function(e) { e.target.send("Yo!"); };

EDIT: Agoo 2.1.0 now implements pub/sub extensions, albeit, using slightly different semantics. I did my best so the same code would work on both servers.

Why didn’t anyone think of this sooner?

Actually, this isn’t a completely new idea.

Evens as the hijack API itself was suggested, an alternative approach was suggested.

Another proposal was attempted a few years ago.

But it seems things are finally going to change, as two high performance server, agoo and iodine already support this new approach.

Things look promising.

UPDATE: code examples were updated to reflect changes in theRack specification’s PR.

The dark side of the Rack and Websockets dreams

There’s a dark side to Ruby web applications.

One thing is clear – real time web applications in Ruby are hurting.

We can argue a lot about who it is that allowed themselves to succumb to the dark side. Some argue that Rails are derailing us all and others that socket.io, Faye and node.js are eating away at out sanity, design, performance and feet… but arguing won’t help.

As I pointed out in an issue I submitted in Rack’s GitHub repo, Rack’s design forces a performance penalty on realtime connections and I think I found a solution to work around the penalty…

If you read nicely to the end (or skip to it) I’ll show you the secret of The Priceless Websocket.

Wait, what’s Rack?

If you don’t know what Rack is, you might want to read Gaurav Chande’s blog post about it, or not.

Basically, Rack is the way Ruby applications (Rails / Sinatra etc’) connect to web servers such as Puma, Thin or (my very own) Iodine. HTTP requests are forwarded to Ruby applications using a Rack interface and the web applications use the same Rack interface to send back the responses.

How is Rack hurting Websockets connections?

Well, not intentionally… but the road to… well, you know.

A bit of background

When Rack started out, it was a wonderful blessing. It was back when smart phones where stupid, Ruby was young and the year was (I think) 2007 or 2006.

At the time, the cloud was a floating thing in the sky and no one imagined Facebook might one day use our phone’s camera and microphone to collect information about us while we talk to our neighbors (does Facebook spy on us? their privacy statement reads like Chinese to me).

Rack adopted a lot of it’s models and requirements from what was a thriving technology at the time (may it rest in peace), CGI.

If you don’t know what CGI is, I’m happy for you.

Basically it means that Rack decided that they would provide the Ruby application with information about the request, but it would not provide a response object, rather, the Ruby application’s return value is the response.

I think you can see the problem from here…

This was probably the biggest mistake Rack ever made, but it took a number of years before we started to realize that fact and voices like José Valim started to warn us about Why our web framework should not adopt Rack API.

The design meant that Rack forced the server to wait for Ruby to return a response and only then any data was sent back through the connection.

No streaming, no persistence – a request goes into Ruby and a response is whatever comes out.

Fast Forward: Back To The Now

The CGI based Rack API seems to be here to stay.

Whatever warnings or rebellions we had died in the hard light of practicality and the DRY Ruby culture.

As big a mistake as it might have been (or maybe it wasn’t a mistake, who cares at this point?), we have a decade of code, gems and community support that requires and implements Rack’s design.

But we can still have Websockets, can’t we?

When the Rack team realized the issue was big, very big, an interim (now permanent?) solution came up – connection hijacking.

This was a small piece of Rack API that said “we know Rack enforces limitations, so why don’t we get out of your way? Here, take the raw IO (the TCP/IP socket) and do whatever you want”.

Now we have Websockets (and SSE and HTTP streaming), but at a price – a high price.

Also, this API (hijacking) is starting to break apart. HTTP2 compliant servers can’t (or shouldn’t) really support hijacking without running HTTP2 connections to the ground. This means no streaming / SSE on HTTP2 connections, and HTTP2 push seems far away…

The Price We Pay

Here is how Websockets are implemented today and a good hint at the price we pay:

  • Applications hijack the TCP/IP socket from the web server.

    Read:

    We tell the professionals to f#@k 0ff and we take control of the low(er)-level IO.

  • Application run their own, separate, IO handling for the hijacked connections:

    Read:

    We run duplicate code and duplicate concern management. The server’s IO handling & our new beautiful idea of how networking should work both run at the same time…

    …I really hope you know what you’re doing with that IO thingy. Is it supposed to turn purple when you code it this way?

  • Applications roll out a Websocket protocol parser… often inside the Ruby GIL/GVL (if you don’t know this one, ignore it, just think “bad”).

    Read:

    We ditch concurrency in favor of a “lock and wait” design, choosing a slow tool when we have an existing fast tool that was giving us both concurrency and speed just a moment ago.

In practice solutions vary between manageable, terrible and tragic.

No matter how good a solution you find, there is always the price of code duplication and running two different IO polling concerns. This is an immediate cause for memory bloats, excessive context switches and other unwieldy slowdowns.

However, manageable solutions implement a low level IO layer written by network programming savvy developers. Such solutions often use the fresh nio4r gem or the famous (infamous?) EventMachine.

On the other side of the spectrum, you’ll find solutions that pay the ultimate price, using any or all of the following: a thread per Websocket design (god forbid), the select system call (why does everyone ignore the fact that it’s limited to 1024 connections?), blocking write calls until all the data was sent…

… My first implementation was this kind of monstrosity that paid the ultimate price in performance. I copied the design from well meaning tutorials on the internet…

You wouldn’t believe the amount of misguided tutorials on network programming.

It’s worst then I’m really telling you. If you think I’m just ranting, go read through the ActionCable code and see for your self.

The Priceless Websocket

Rack’s API, which forces us to pay such a big price for real time applications, can be easily adjusted to support “priceless” (read: native) websocket connections.

The idea is quite simple – since the Rack response can’t be adjusted without breaking existing code and throwing away a decade’s worth of middleware… why not use Rack’s request object (the famous env) to implement a solution?

In other words, what if everything we had to do to upgrade from HTTP to Websockets was something like this env[upgrade.websockets] = MyCallbacks.new(env)…?

What can we gain? Well, just for a start:

  • We can unify the IO polling for all IO objects (HTTP, Websocket, no need for hijack). This is bigger then you think.

  • We can parse websocket data before entering the GIL, winning some of our concurrency back. This also means we can better utilize multi-core CPUs.

  • We don’t need to know anything about network programming – let the people programming servers do what they do best and let the people who write web applications focus on their application logic.

I’ve got a POC in me pocket

I have a Proof Of Concept in my POCket for you, if you’re running Ruby MRI 2.2+ with OS X (or maybe Linux).

Yes, I know, I tested this proof of concept on one machine (mine) and it requires a Unix OS and Ruby C extension support (so it’s limited to Ruby MRI, I guess), but it works, it’s sweet and it provides a Rack bridge between a C Websocket server and a Ruby application.

The QAD (quick and dirty) rundown

  1. The server pushes the HTTP request to the Ruby application.

  2. The Ruby application can’t use the Rack response to ask for a Websocket upgrade (remember, we have middleware, years of code and protective layers that prevent us from doing anything that isn’t a valid HTTP response)… So…

…The Ruby application updates the request (the Rack env Hash) with a Websocket Callback Object – it’s so simple you will cry when you see it.

  1. We’re done. The server simply uses the wonderful qualities of Ruby Meta-Programming to add Websocket functionality to the Websocket Callback Object and we’re off to the real-time arena.

We Want Code, We Want Code…

Here’s my Proof Of Concept, written out as a simple Rack application. If you’re like me – too lazy to copy and paste the code – you can download it from here.

The proof of concept is the ugliest chatroom application I could imagine to write. It uses nothing except Rack and the Iodine web server.

Iodine is a Rack server written in C and it implements the suggested solution using upgrade.websocket and a feature check using upgrade.websocket? (the feature check is only available for upgrade requests).

I’m hopeful that after reading and running the code, you will help me push upgrade.websocket into the Rack standard.

The gemfile

Every Ruby application seems to start with a gemfile these days. They are usually full with goodies…

…but I’ll just put in the server we’re using – it isn’t publicly released, so we need a github link to it. I think that’s all we need really, since it references Rack as a dependency.

Save this as gemfile:

<br /># The Iodine Server

gem 'iodine', '~> 0.2.0'

The Rack Application

Now you’ll see the simple magic of a native websocket implementation – no Ruby parser, no Ruby IO management, it’s all provided by the server, no hijacking necessary.

The code is fairly simple, so I’ll add comments as we go.

Rack applications, as a convention, reside in files named config.ru. Save this to config.ru:

<br /># The Rack Application container

module MyRackApplication

  # Rack applications use the `call` callback to handle HTTP requests.

  def self.call(env)

    # if upgrading...

    if env['HTTP_UPGRADE'.freeze] =~ /websocket/i

      # We can assign a class or an instance that implements callbacks.

      # We will assign an object, passing it the request information (`env`)

      env['upgrade.websocket'.freeze] = MyWebsocket.new(env)

      # Rack responses must be a 3 item array

      # [status, {http: :headers}, ["response body"]]

      return [0, {}, []]

    end

    # a semi-regualr HTTP response

    out = File.open File.expand_path('../index.html', __FILE__)

    [200, { 'X-Sendfile' => File.expand_path('../index.html', __FILE__),

    'Content-Length' => out.size }, out]

  end

end

# The Websocket Callback Object

class MyWebsocket

  # this is optional, but I wanted the object to have the nickname provided in

  # the HTTP request

  def initialize(env)

    # we need to change the ASCI Rack encoding to UTF-8,

    # otherwise everything with the nickname will be a binary "blob" in the

    # Javascript layer

    @nickname = env['PATH_INFO'][1..-1].force_encoding 'UTF-8'

  end

  # A classic websocket callback, called when the connection is opened and

  # linked to this object

  def on_open

    puts 'We have a websocket connection'

  end

  # A classic websocket callback, called when the connection is closed

  # (after disconnection).

  def on_close

    puts "Bye Bye... #{count} connections left..."

  end

  # A server-side niceness, called when the server if shutting down,

  # to gracefully disconnect (before disconnection).

  def on_shutdown

    write 'The server is shutting down, goodbye.'

  end

  def on_message(data)

    puts "got message: #{data} encoded as #{data.encoding}"

    # data is a temporary string, it's buffer cleared as soon as we return.

    # So we make a copy with the desired format.

    tmp = "#{@nickname}: #{data}"

    # The `write` method was added by the server and writes to the current

    # connection

    write tmp

    puts "broadcasting #{tmp.bytesize} bytes with encoding #{tmp.encoding}"

    # `each` was added by the server and excludes this connection

    # (each except self).

    each { |h| h.write tmp }

  end

end

# `run` is a Rack API command, telling Rack where the `call(env)` callback is located.

run MyRackApplication

What does the application do?

  • The application checks if the request is a websocket upgrade request.

  • If the request is a websocket upgrade request, the application sets the Websocket Callback Object in the env hash (technically the request data Hash) and sends back an empty response (we could have set cookies or headers if we wanted.

  • If the request is not a websocket upgrade request, it sends the browser side client file index.html (we’ll get to it in a bit).

That’s it.

The Websocket Callback Object is quite easy to decipher, it basically answers the on_open, on_message(data), on_shutdown and on_close callbacks. The on_shutdown is the least common, it’s a server side callback for graceful disconnections and it’s called before a disconnection.

I like the snake-case Ruby convention and I thought it would server the names of the callbacks well (instead of the JavaScript way of onclose and onmessage which always annoys me).

There were a number of design decisions I won’t go into here, but most oddities – such as each == “each except self” – have good reasons to them, such as performance and common use patterns.

The Html Client

Every web application needs a browser side client. I know you can write one yourself, but here, you can copy my ugly a$$ version – save it as index.html:

<!DOCTYPE html>
<html>
<head>
    <a href="https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js">https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js</a>

        ws = NaN
        handle = ''
        function onsubmit(e) {
            e.preventDefault();
            if($('#text')[0].value == '') {return false}
            if(ws && ws.readyState == 1) {
                ws.send($('#text')[0].value);
                $('#text')[0].value = '';
            } else {
                handle = $('#text')[0].value
                var url = (window.location.protocol.match(/https/) ? 'wss' : 'ws') +
                            '://' + window.document.location.host +
                            '/' + $('#text')[0].value
                ws = new WebSocket(url)
                ws.onopen = function(e) {
                    output("<b>Connected :-)</b>");
                    $('#text')[0].value = '';
                    $('#text')[0].placeholder = 'your message';
                }
                ws.onclose = function(e) {
                    output("<b>Disonnected :-/</b>")
                    $('#text')[0].value = '';
                    $('#text')[0].placeholder = 'nickname';
                    $('#text')[0].value = handle
                }
                ws.onmessage = function(e) {
                    output(e.data);
                }
            }
            return false;
        }
        function output(data) {
            $('#output').append("<li>" + data + "</li>")
            $('#output').animate({ scrollTop:
                        $('#output')[0].scrollHeight }, "slow");
        }

    <style>
    html, body {width:100%; height: 100%; background-color: #ddd; color: #111;}
    h3, form {text-align: center;}
    input {background-color: #fff; color: #111; padding: 0.3em;}
    </style>
</head><body>
  <h3>The Ugly Chatroom POC</h3>
    <form id='form'>
        <input type='text' id='text' name='text' placeholder='nickname'></input>
        <input type='submit' value='send'></input>
    </form>
     $('#form')[0].onsubmit = onsubmit 
    <ul id='output'></ul>
</body>
</html>

(yes, I’m an expert at CSS and I couldn’t care less about the design for this one)

Running the server

To run the application we just wrote, we need to run two commands from the terminal (in the folder where we put the application files).

Install Iodine and any required gems using*:

bundler install

Run the application (single threaded mode) using:

bundler exec iodine -www . -p 3000

As well as our dynamic web application, this will start a static HTTP file server in the current folder (the -www option pointing at .), so make sure the folder only has the application there – or cat pictures, the internet loves cat pictures.

Now visit localhost:3000 and see what we’ve got.

A nice experiment is to run the server using multi threads (-t #) and multi processes (-w #). You’ll notice that memory barriers for forked processes prevent websocket broadcasting from reaching websockets connected through a different process. Maybe try using a number of open browser windows to grab a few different processes.

bundler exec iodine -www ./ -t 16 -w 4

You can benchmark against Puma or whatever so make sure the HTTP service isn’t affected by this added server side requirement. It’s true that the server added a review to the request as well as the original review for the response, but it doesn’t seem to induce a performance hit.

* If you got funny errors while trying to compile Iodine, it might not be supported on your system. Make sure you’re running a Linux / OS X / BSD operating system with the latest clang or gcc compilers. Neither Solaris nor Windows are supported at the moment (they have very different IO routines).

A few pointers (not C pointers)

  • You may have noticed that the on_message(data) provides a very temporary data string. The C layer will immediately recycle the memory for the next incoming message (not waiting for the garbage collector).

This is perfect for the common JSON messages that are often parsed and discarded and it enhances Websocket performance, but t’s less comfortable for this chatroom demo where we need to create a new string before broadcasting.

  • Try benchmarking Iodine against other servers that don’t provide native websockets and meta-programming facilities (i.e. against Puma or Thin), I think you’ll be surprised, especially when using more then a single worker process and a few extra threads.

To run the application with Puma: bundler exec puma -w 4 -p 3000

To run the application with Iodine: bundler exec iodine -t 16 -w 4 -p 3000

Benchmark using ab: ab -n 100000 -c 20 -k http://127.0.0.1:3000/

Benchmark using wrk: wrk -c20 -d4 -t4 http://localhost:3000/

How many concurrent connections can they handle? How much memory do they use?

You may say I’m a dreamer…

…but I really hope I’m not the only one.

Ruby is a beautiful, beautiful language. I’m sad to see so many people complain about how hard and non-performant it is to write real-time web applications in Ruby.

I’ve opened an issue ar Rack’s GitHub repo trying to explain the upside for this type of addition to the Rack Specification, but it’s been a few weeks and no one from the team seems to have looked at it.

If you like the idea, please visit the issue and add your voice.


Update

I’m thankful for all the positive feedback and attention this blog post received. It sadly brought to sharp relief the slowness with things are (not) moving with regards to the Rack issue… Perhaps the right thing would be to open issues with the other servers (Puma, Thin, Unicorn etc’) and ask them to implement the upgrade.websocket workflow.

I edited the post to change the proposed rack.websocket name I used before to upgrade.websocket, since this will allows future protocols to be placed under the same “namespace”… upgrade.tcp anyone?


Update2

When I first published this article, Iodine was still in development.

I tested Iodine quite a lot and I think I worked out most of the issues that might have affected it and I think it could be used for production.

Iodine had been released as a gem.

Also, I fixed some changes that WordPress introduced independently (their new editor is terrible).