Fix Failures Fast
What makes PCs fail? Recently Microsoft Research published a research paper that looks at PC failures. The research spanned a million consumer PCs and examined failure rates and the suspected causes of those failures. The data came from the Windows Error Reporting system that runs after crashed inside of Windows, so this data doesn’t include other operating systems. It also is for fatal errors, those resulting from system crashes and doesn’t include non-fatal errors.
It’s an interesting read, though a little boring since it’s an academic style paper. It does have a few highlights thatcaught my eye. The first is that when a failure occurs, the chance of another failure occurring soon is high. That means that disk failures predict another failure is likely in the next 30 days. Server machines might be different, but those of you with consumer grade hardware might think about an immediate backup and the purchase of a hard disk replacement if you have a fatal disk error.
There are other highlights that over-clocking can dramatically increase CPU failure rates. It isn’t completely clear if this means a shorter lifetime, but if you’re like me, having errors that crash systems is incredibly annoying and time consuming, so I’m not sure I’d ever want a system over-clocked. There was another interesting point that said CPUs and RAM in white box (non-OEM) machines was less reliable than brand name systems. I struggle to see how a CPU would be less reliable, but that was part of the analysis. I’m not sure that would stop me from purchasing my own parts in the future, but it is something to keep in mind. Perhaps you should burn in the machine as soon as you can to allow for a parts return if you have issues.
Laptops were more reliable, which was surprising, and also comforting since most IT pros I know use laptops. People tend to keep old memory around, which can wear out over time, so be careful about moving memory from machine to machine. There weren’t a tremendous number of surprises in the study, but it did make me appreciate my long running desktop that rarely reboots, and even more rarely crashes. I was pleased to see such a detailed analysis being undertaken by Microsoft Research using data from actual Windows systems. Hopefully there are engineers also looking at the data and improving the OS as well.
The research paper for this editorial was highlighted in the Brent Ozar, PLF newsletter.
The Voice of the DBA Podcasts
We publish three versions of the podcast each day for you to enjoy.