Scaling socket.io across multiple nodes

Real-time web interactions are an interesting challenge that stays relevant as developers gravitate between single-page applications, server-rendered pages and everything in between. Websockets are a possible solution for interacting live with your users, specially if server sent events are not enough and polling at some sustainable interval is too slow. One popular client-server library built on top of NodeJS and Express is Socket.io, which not only implements websockets but also falls back neatly into polling. This is important for many production cases since, for example, a large European ISP like Vodafone blocks websockets on their networks.

In my particular case, if you follow this blog you may already know that I work with event apps both on mobile platforms and for the web. This means that websockets are a best case scenario that can get affected by multiple factors like corporate firewalls, bad hotel wi-fi or an area with weak mobile signal. But on the other hand, real-time interactions can add a lot to an event. And if getting questions from an audience is something that may not stress these connections too much, things are very different with live polling for example, where everyone can be sending their vote within a fifteen second window. That's why a library that focuses on resilience and not only on websocket performance is very helpful for real-time interactions that can happen from just about anywhere.

For benchmarking how these interactions can perform under load, I've been using Artillery.io which comes with Socket.io built-in and can spin-up AWS resources to scale your tests reliably. I try to setup scenarios that are as close as possible to the real use cases and lean towards the worst that could happen. Let's settle on one that is easy to understand: a livestream chat where under a five minute window everyone shows up and says hi. This involves authenticating each connection to assign it to the correct chat room and, as each person types hello, everyone receives their message. Repeated testing with a single NodeJS process helped me identify several bottlenecks and not all of them were related to Socket.io as it wrestles with polling and websockets. The services that I have working together to respond to this scenario involve NodeJS, Redis for pub/sub messaging and MySQL as the database.

Even without starting to run load tests, you can imagine not wanting to wait for database writes in this scenario and so it's natural to reach for something like Redis. So I have a main NodeJS process that handles all the connections and publishes any inserts through a messaging channel that a secondary process subscribes to and lazily puts into the database. Running tests over this setup eventually gets to a point where the application state is getting correctly cached or queued to be saved and the relevant delays are all about Socket.io handling connections to get everyone communicating to everyone else (like I said, a worst case scenario). In practice this means that, once you get above 5k people arriving within five minutes, clients successively time out as Socket.io can't keep up with the barrage of broadcasting requests. Even just above 3k you start to see this problem as upgrading each connection from polling to websockets already represents load that is made worse by having every user greet all the others. This is when I double-checked that yes indeed we can't offer only websockets as a way to connect.

For someone with still a limited experience of NodeJS, I was happy to see some performance gains while still having only one process handling all these users arriving in a short period of time. But of course the next step now was to see what I could take from Socket.io's documentation on scaling across multiple nodes. Again, because of having polling connections before upgrading them to websockets, the solution proposed involves sticky sessions that rely on NginX distributing them and Redis to coordinate between processes. However, there are not a lot of options for NginX to identify what goes where. In the case of event apps where everyone at a venue can connects out of the same IP address, it's not viable to use IPs to distinguish between sessions in NginX. Furthermore, websockets don't allow for custom headers that could alternatively fulfil that purpose.

Having also not a lot of experience with NginX, I took a step back and considered what steps actually had to happen to distribute the load across different nodes. Even if I somehow managed to implement what is suggested in the documentation, I still needed to develop the logic for processes to communicate with each other as messages get posted in chat, deleted, pinned, etc. That rabbit hole seemed to be the most relevant while the rest of the issue looked approachable without relying so much on NginX. Clients can reach directly for different endpoints distributed by some derivation of the user's identification which gets validated when they connect. And each node can be maintained by identical processes, the only difference being in this case the subdomain and respective port that each one is listening on. Through Redis, nodes can pub/sub to each other's messages to maintain a common state of the chat for each livestream.

That was what I ended up implementing and testing for up to 12k users saying hello in chat within five minutes. The added overhead of publishing and subscribing to all state changes means that you probably want at least three nodes running to justify scaling them in this way, but once you do, further nodes perform as well as you might expect. In this scenario, this means having about one node for each 3k users to guarantee you don't get any time outs, making sure that attendees coming into a virtual event get a nice reliable first experience. For other scenarios like some massive live polling, it's likely that not having to broadcast a firehose of messages to everyone may provide some breathing room that allows for that smaller duration, but that also needs to be tested.

Meanwhile, it's possible to even put Socket.io into question given how other technologies perform better than NodeJS when benchmarking for number of simultaneous websocket connections. But the batteries that come included with this well-established library are risky to dispense with. Besides the polling fallback, it includes automatic reconnections and packet buffering, all things that you may end up having to implement and battle-test yourself if you only focus on websocket performance. Also, there's the YAGNI factor of investing into another technology while this scale that NodeJS easily allows for already matches the needs of many events. And beyond that, there is also the option of going for a third-party solution like Firebase that promises to get you to numbers like 50k with a not so complex re-write. All things have trade-offs, so I guess it's a matter of relying on the right solution at the right time.

If you liked this article, you might want to subscribe to the RSS feed, maybe follow my Twitter or learn more about me.

More from 🌍; view from the web
All posts