Monitoring your systems is important. It’s not just me that thinks so, as plenty of experienced DBAs and developers know the value of monitoring. Heck, most people have learned to build some sort of metric collection into their software. Azure makes it easy to instrument your application and gather lots of data on how well things are working. Perhaps too easy to gather too much data and then you pay for it, or can’t find time to analyze it. High performing software development shops use monitoring in their Continuous Integration (CI) and Continuous Delivery (CD) pipelines to better understand the health of their code and speed of their workflow, in addition to instrumenting the actual application.
For those of us that need to ensure our database servers are running well, we not only need monitoring, but also alerting. I ran across a couple articles that have thoughts about monitoring and the difference between monitoring and alerting. While I don’t completely agree with all the items in the second piece, I do think that it’s important that you get alerting working well.
I’ve had more than my share of un-actionable alerts, or even unnecessary alerts in my career. These days I’ve learned to better classify those items that matter to me. Most of the time what I find myself doing is downgrading most alerts because very few are actually mission critical. Far too often I’ve worried about 100% CPU or slow log writes or even zero sales in an hour or some other metric that “seems” critical. However, since few of these alerts stop business from flowing, I’ve learned to lower their priority or just remove them as alerts and allowing monitoring to track the values. I do need to watch the monitoring and fix issues, but I don’t need to get up at 3am.
The other thing I’ve worked to do is automate responses to problems. If I know there are ways a computer can respond, let it. Don’t get a human involved if the system can manage itself. Certainly the automated solutions don’t always work, but have some escalation built in that only alerts a human after the system has exhausted its own responses. After all, we don’t want to exhaust humans if we don’t need to do so.