Creating a poll that doesn't crash under load
I was hired to create a poll that would run during an audience show at the Nerdland Festival this weekend. The poll ran during one of the headlining shows in a tent with a 3.5K capacity. During the show the hosts showed a QR code on the big screens, which got scanned by loads of people in the audience, and they all voted at exactly the same time.
The obvious solution
My first reaction after getting the briefing was: simple, some serverless hosting (Vercel or Cloudflare), backed by some light database in the cloud (like Turso or D1). I settled on the Cloudflare route and went to work. Everything looked fine, and I was nearly ready to deliver the project when I ran a final load test using this free tool and quickly discovered that we had a problem.
250 requests per second, 26% success rate
Under sustained load of 250 simultaneous requests per second the app would start returning 500 errors, and the requests that got through started taking a very long time.
The reason was simple: the database would start queuing up the writes to the database and time out.
While there probably wouldn't be 250 people all pressing submit at exactly the same moment in the audience, I didn't want to take the gamble on this one. So I went back to the drawing board.
If only I could do this all in memory, I thought, without filesystem writes. But then a Worker crash or restart would mean lost votes. Same goes for Cloudflare spinning up multiple workers to handle load, which is a scenario that probably would happen here.
Enter Durable Objects
Luckily the people at Cloudflare came up with Durable Objects. A Durable Object is essentially a single-threaded actor: one instance per object, globally, processing requests in order. No two workers ever touch the same state at the same time, so the write contention that was killing the database simply doesn't exist. Storage is attached but optional, and you decide when to flush. For the vote I kept the tally in memory and persisted every minute, or every 10 votes, whichever came first. If the worker hibernated or restarted, at most 10 votes would be lost. Negligible.
I ran the load test again, and the average response time went down from 4 seconds, to 150ms, and success rate went up from 26% to 100%.
I also ran a benchmark on throughput and it landed on 1500 requests per second for the Durable Objects approach vs 50 per second for D1. (I guess the 1500 req/s was also limited by my ISP speed.)
Part of the project was also providing video output for the screens in the theatre, so the hosts could show the results on screen. Because I didn't want to take any chances to prevent writes to the votes Durable Object from blocking updates to the screen state (switching between pages of the results, ...) this was stored in a separate Durable Object.
Extra safeguards, and a 9-year-old
The show ran three times. To strengthen it even further I added the option to restore backups from the previous days. Should the vote not work during the second or third run, we could've easily restored a prior vote (since results would've been similar). And to rule out the chance of the web UI crashing for whatever reason I also built a small TUI to control everything without a browser.
But even with all those safeguards in place you cannot rule out everything. During the second show I had the control panel open on my Mac to quickly be able to intervene should something go wrong. I almost had a heart attack when my eldest daughter came down and said "I saw the Nerdland thing open on your computer. I clicked around and changed the charts to percentages". Fortunately out of all the buttons she could've clicked, it was the only one that didn't affect anything on the screens in the theatre.
Next time I'm also building in a kiddie lock.
0 comments
No comments yet.