The Day 300 Firefighters Watched My Architecture Burn

In 2012, at EVALS, I had a major problem on day one of our launch.

"The wheels are falling of the go-cart!!"

We were on site at a major fire academy. 300 users on 300 iPads, recording videos, photos, and notes in real time. All of it flowing to the server for processing and upload into S3.

Except it wasn't flowing anywhere... It was disappearing.

I had built stateful architecture and misconfigured our AWS load balancer completely. Every user session spun up its own EC2 instance and when they logged out, the instance spun down and the unprocessed data went with it - to a black hole.

We lost thousands of files. Client media. Gone. Not corrupted, not misplaced. Just evaporated into infrastructure I had configured to forget.

Within a day, I refactored. Our iPad app started sending requests to the backend API, which generated presigned S3 endpoints so that the media could be transferred directly to S3 from the device. I built an offline queue for connectivity drops and also fixed the load balancer before it turned a bad day into another $10,000 invoice.

Within a week, we had completely replaced a core system while clients were actively using the product. Surgery on a moving patient. Not recommended but sometimes necessary.

Stateless architecture exists because servers should not be trusted with memory. Neither should founders who think they'll fix it after launch.

❤️
Jake

SONG