I’ve never felt that zero downtime is possible for a system over any length of time. There are no shortage of companies that seek to prove me wrong, and some are doing very well. For example, when is the last time that the Google search engine was down? It happens, though I haven’t seen this in a long time. Many of the highly available applications out there are built using as distributed applications across many machines, so that even if there are failures, they don’t cascade to interrupt users.
I was thinking about this as I saw a post asking database engineers to architect for zero downtime in 2016. That’s a good goal, and certainly if you work at a high profile retailer or service company, you should look for ways to improve availability.
In fact, I would guess that anyone struggling with specific events, like Black Friday, would be working on this problem constantly. I remember years ago Michelle Ufford spoke about the challenges at GoDaddy during the Super Bowl due to the advertisements the company ran. They had log files that couldn’t catch up to the load for hours and spent an entire year working to build a better database system.
Ultimately I think the best way to handle large spikes of activity is by building an application that avoids putting large spikes of activity on your database. Use messaging and queues to buffer traffic. Use read only copies of your database for traffic that doesn’t need to write to the main database. Anywhere that you can limit the load on your system can help prevent the database becoming a bottleneck for your system.
Perhaps more importantly, if you can spread the load, you don’t need to purchase more and more hardware. If nothing else, I think this is a good argument for better database architectures for applications. However if you’re like me, most of your systems will have to deal with the hardware that is available. In that case, the best you can do is write better T-SQL and ensure you have given SQL Server enough, but not too many, indexes.