Monitoring and Alerting

Monitoring your systems is important. It’s not just me that thinks so, as plenty of experienced DBAs and developers know the value of monitoring. Heck, most people have learned to build some sort of metric collection into their software. Azure makes it easy to instrument your application and gather lots of data on how well things are working. Perhaps too easy to gather too much data and then you pay for it, or can’t find time to analyze it. High performing software development shops use monitoring in their Continuous Integration (CI) and Continuous Delivery (CD) pipelines to better understand the health of their code and speed of their workflow, in addition to instrumenting the actual application.

For those of us that need to ensure our database servers are running well, we not only need monitoring, but also alerting. I ran across a couple articles that have thoughts about monitoring and the difference between monitoring and alerting. While I don’t completely agree with all the items in the second piece, I do think that it’s important that you get alerting working well.

I’ve had more than my share of un-actionable alerts, or even unnecessary alerts in my career. These days I’ve learned to better classify those items that matter to me. Most of the time what I find myself doing is downgrading most alerts because very few are actually mission critical. Far too often I’ve worried about 100% CPU or slow log writes or even zero sales in an hour or some other metric that “seems” critical. However, since few of these alerts stop business from flowing, I’ve learned to lower their priority or just remove them as alerts and allowing monitoring to track the values. I do need to watch the monitoring and fix issues, but I don’t need to get up at 3am.

The other thing I’ve worked to do is automate responses to problems. If I know there are ways a computer can respond, let it. Don’t get a human involved if the system can manage itself. Certainly the automated solutions don’t always work, but have some escalation built in that only alerts a human after the system has exhausted its own responses. After all, we don’t want to exhaust humans if we don’t need to do so.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.4MB) podcast or subscribe to the feed at iTunes and LibSyn.

Gauges and Alarms

I was looking through the news this week and found a piece by Michael J. Swart that I really enjoyed. It talks about different types of metrics that he has used when monitoring and alerting on his database servers. Michaal talks about the difference between alarms and gauges, which are usually base on the time scale in which a DBA or administrator needs to react to a situation. Alarms are created when something grows in a short term and causes an issue. The gauges tend to be longer term items that eventually cause a problem.

I’ve had a similar view in the past. However I look at these types of metrics from the standpoint of how I tackle my job as a DBA. I also try to think about these issues as a developer, ensuring that telemetry helps track issues that occur over time and acute issues cause a graceful degredation (or exit) and not an unexpected one.

When I’ve needed to manage systems, I ensure that I have alarms set up. I need these for the situations where something goes wrong in a short period of time. However I never want alarms to be based on a lack of visibility into the system over time. I want to try and eliminate as many alarms as possible with better long term monitoring that looks for indicators that the system might have issues. I’ve found that many (not all) of the alarms that I receive could have been prevented if I’d been paying close attention.

In other words, I’m a proactive DBA, looking at gauges (using MIchael’s terms), not a reactive one counting on alarms.

I can’t predict everything, but when I find that looking back at data is helpful. If I find a trend that could have alerted me, I’ll evolve my monitoring system to try and track those potential issues in the future. I’ve written my own custom metrics in the past, which track data, either PerfMon type counters or business data, and then let me know when the growth, decline, or change is unexpected. In this way, I tend to find I can often proactively prevent issues. If I can use my data to “predict” an issue my boss doesn’t believe in, I’ve also found my budget for resources may increase a bit, allowing me to manage systems even better in the future.

Steve Jones

Finding Production Drift

A few years ago when I started delivering presentations about automating database delivery with Grant Fritchey, I was surprised by something. We were rehearsing a talk and Grant mentioned that there were really only two places where we knew the state of our code: in a VCS and in production. I hadn’t really thought about that before, but realized it was true.

While the development environments are always in flux, by using a VCS, you can determine the state of your code at any point in time. However since changes can occur in production (DBAs adding indexes, changing security, etc), you can’t count on a release branch of some other designation as being a guarantee of the state of your live environment. Production is always its own version of the truth.

Ideally all of the changes being made on your production system are tested in development, QA, and pre-production environments before being deployed, but very few companies have a mature enough deployment pipeline to allow them to push hot fixes and patches rapidly enough, especially when they need to be made in an emergency.

More than likely some changes will always be made in production directly, especially small ones that affect data. You can’t prevent this, but you should be able to deal with the aftermath and ensure that your system doesn’t destabilize from the enhancements. Without a doubt DBAs should log any scripts they run, but it’s easy to forget things, especially when changes are made in firefighting mode or at 3am.

That’s why having some sort of monitoring system to watch production for any changes becomes an important logging mechanism for DBAs. Whether you build one yourself, or buy some software, you use the native tools or bolt something onto SQL Server, make sure that you put something in place. Work to push your ad hoc changes back to development, and ensure they get into a VCS at some point.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.5MB) podcast or subscribe to the feed at iTunes and LibSyn.


How often have you been caught off guard by some issue with one of your databases? I would guess most people have received a phone call, email, or trouble ticket that alerts them to the fact that something has gone wrong. A few of you might have even had an irate customer or manager wander over to your desk and loudly inform you of an issue. Those are not great times, and most of us would prefer to find out about problems before others bring them to our attention.

This past week was the 66th T-SQL Tuesday blog party. If you haven’t heard of it, I would encourage you to look each month for the announcement on Twitter and participate. It’s a great chance to share some knowledge, and showcase your own experiences or viewpoint on a particular topic. I’d actually encourage all of you to go back and write posts about the past topics at some point. It’s a great way to challenge yourself to document your own experiences in a concrete way.

The theme this month was monitoring (thanks to Catherine Wilhelmsen b | t), and this is exactly the way that you build awareness in your environment. Monitoring allows you to ensure that you, as the DBA, learn about issues, or potential issues, first. Monitoring doesn’t solve your problems; that’s up to you to do, but it can help you proactively detect issues early and perhaps mitigage them before users notice them.

The blog entries this month cover all sorts of issues that you might think about watching for. The 825 errors in your log, metadata about your backups, jobs, and more. There’s a variety of posts that cover all sorts of home grown systems, as well as noting that third party products work well. You have a huge variety of choices, from building your own tools and processes or purchasing something like SQL Monitor from Red Gate. However no matter what your budget, resources, time, desire, and more, I’d highly recommend that you really think about ensuring that you are monitoring your systems in a way that lets you know how things are working.

What to do when they’re working poorly, that’s up to you. Hopefully the information we publish at SQLServerCentral, and the links we send out every week from Database Weekly, can help you learn to be better at your job.

T-SQL Tuesday #66 – Monitoring SQL Server

tsqltuesdayThis is a T-SQL Tuesday that not only makes me smile, but makes my boss happy. In fact, I’m sure that quite a few people at Redgate Software, my employer, are thrilled with this topic since we have a product that helps you here: SQL Monitor.

T-SQL Tuesday #66 has the theme of monitoring, hosted by Catherine Wilhelmsen. Anything having to do with monitoring is fair game. Custom scripts, home grown applications, even third party software.

If you’re a #SQLNewBlogger, this is a great chance to keep going with a new post. If you can’t write now, come back and publish something later. If you post something on May 12, 2015, you’ll get included in Catherine’s roundup. However if you can’t make it, you’re still a T-SQL Tuesday writer.

Either way, follow #tsql2sday and learn a few things.

Self Monitoring

My post is going to be more of a philosophical one, with a little code.

Most of the time I’ve worked with SQL Server, I’d had some responsibility for a production instance. Even when I was writing code and building software, I’d still find myself as the backup, if not the primary, accidental DBA. I learned early on that I needed to have some way to track what was happening on an instance.

One of the ways in which I did this was by ensuring each instance could monitor itself. I’d found far too many times that having a process connect remotely to the instance and gather data was a way to lose data. Too often something would get missed. The exact time that a process connected was the time things didn’t work.

As a result, I decided to keep some monitoring on the local instance. Even though I usually had a central server roll up information, if that instance dropped off the network (or my monitoring server died), I’d still know what was going on. This was especially important when I worked in the financial industry or was bound by ISO 9000 standards.

Note: These days I don’t bother to manage monitoring software of my own on an instance. I use SQL Monitor because I work for Red Gate, but no matter who my employer might be, I’d buy software because it’s worth the cost. I don’t have the time to worry about maintaining my own software for most monitoring tasks.


I had a setup routine that I used to use on every instance. It did a few things, the first of which was create a DBAAdmin database. This was the place I could put anything I needed to monitor my instance. I tried to keep all the code the same between instances, and tried to write idempotent code so that re-running the install wouldn’t cause issues.

Once I had the database, I’d add tables and procedures for various purposes. For example, one of the things that needed to be tracked each day was the backups for the databases. So I created a table called DBBackups.

CREATE TABLE DBBackups ( BackupID INT IDENTITY(1,1) , Backupdate DATETIME , DatabaseName VARCHAR(2000) , BackupType CHAR(1) , Filename VARCHAR(2000) , sizeGB NUMERIC(15,2) ) ;

This table had a corresponding procedure that was designed to scan the filesystem and report back on the latest backups written to the file system. The details of each file are recorded here.

Why? If there’s a problem with monitoring, the local machine still has the information about backups. What’s more, I can usually get more details here than I’m capturing in a monitoring system that’s looking at the last backup date or just getting full backups. If msdb history is wiped, I have a second copy here.

However I also need to remove data over time, so this solution usually has three parts.

  • Table for data storage
  • Procedure(s) for gathering data
  • Procedure and job to remove data older than xx days

That’s simple, but it’s a decent amount of work. However I only use this for certain areas. I used to really care about backups, but SQL Monitor captures that for me. However I might care about sp_configure settings. While SQL Monitor can alert me if there are changes, how do I know what changed? I’ve used a process like this to keep the last week’s worth of sp_configure information, captured every day to let me look back.

As I noted above, I wouldn’t rebuild a monitoring solution for overall checking of a database, but I might capture specific information using this process, with the table and procedures tailored to the specific information I care about.

Monitoring After Deployment

Deployments of database and application changes have tended to be stressful in my career. I’ve seen the actual deployment of changes fail, installation programs crash, the startup of an application error out, or the famous blue screen of death from a .DLL change. I’ve been most wary, and worried, when the changes were out of my control, and I had no chance to review things.

For most database upgrades, however, I’ve been able to examine scripts. Even when an installation program is making changes, I could run a test on a separate instance and use Trace to capture the SQL that would be executed. However even when the actual installation succeeds, what I’m often most worried about is a logic or performance problem that won’t reveal itself for hours, often not until the next business day when the system is under full load.

This week I wanted to know what formal (or informal) checks people might have to detect issues. I’d like to know if you are you actually monitoring systems in a scientific way that allows you to determine if some database deployment has gone awry.

What Monitoring process do you use to determine if a deployment is working?

One of the customers for Red Gate software noted recently that they have a series of metrics being captured on their systems with SQL Monitor. They can compare the performance of two time periods, say the 4 hours before a deployment with the 4 hours after (or 4 hours the next day) to determine if any issues have surfaced. That’s similar to what I’ve done in the past for performance using other tools, or my own metrics. I’ve often kept baselines of common queries run during the day, and I can compare that to post-deployment metrics to uncover potential issues.

Logical errors are more problematic, and I’ve often wanted to be sure I had a couple of backups stored, and at least one backup on a warm standby in case we find information is being incorrectly changed or calculated. Those are the worst errors, and since they is no way to determine how they might surface, I have usually ensured myself and other DBA staff were on standby to perform some manual ETL to move data around between databases in a hurry.

Monitoring is an important part of being a DBA. It’s not as critical as backups or security, but it’s not far behind in letting you know when something is not right. Let us know this week what tricks you might have to ease the worries you have after database changes.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.8MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

Avoiding a DBA’s Worst Days with Monitoring

SQL in the City Abstract: When things go wrong with a database, it can be the start of the worst day of a DBA’s life. Join Steve Jones as he examines the problems uncovered by The DBA Team and how you can prevent them with proactive monitoring and in-depth knowledge of SQL Server.

General Abstract: A DBA usually has a bad day because they are unprepared for issues that commonly occur or unaware of situations that can cause problems. Learn about the five things Steve Jones finds to be most important for DBAs and how you can be ready to handle issues in these areas:

  • Backups
  • Space
  • Security
  • Resources
  • Deployment

This session does include Red Gate tools but explains how issues can be avoided with your own utilities.

Slides: DBAsWorstDays.pptx

Placeholder resources:

Custom Metrics used in the demo:

The DBA Team #1–Code and Slides

Our first DBA Team event, in Richmond, VA just before SQL Saturday #187 went well. Overall I think our experiment was a success and we’re already talking about where and when we might do this again.

In the meantime, we didn’t make a separate site for this series of events, being an experiment and all. I’m adding this post as a placeholder for the various slide decks and code.

Core Monitoring for SQL Server (Steve Jones)

Good DBAs ensure that they are always aware of the state of their instances. All systems should have monitoring in place, not just so you know when things go wrong, but so you understand what a normal workload looks like and can plan for the future. This session will cover the basics of monitoring a SQL Server system and the various metrics you should be tracking.

Getting Started with SQL Server Backup (Grant Fritchey)

Backups are fundamental to protecting the investment your business has in its data and they’re the foundation of disaster recovery planning. We’ll go over best practices for database backups, to ensure you’re establishing that foundation correctly within your systems. This introductory level session covers full, log, and differential backups, as well as restores and restores to a point in time. Come along to be sure you’ve got the right protection in place for your systems.

Understanding Database Corruption (Grant Fritchey)

A DBA’s primary purpose is to ensure that the information in their charge is accessible by the correct people within their organization. Despite everything you do to make sure you’ve got your servers configured, monitored, and tuned, with well-tested backups in place, you can still lose data through corruption. But what is corruption in a database? This session lays out exactly where database corruption can come from, how to find out exactly where the corruption is within a database, and  the methods you have available to recover from database corruption.

Indexing for SQL Server (Steve Jones)

Indexes are important for improving the performance of your queries, but they add overhead to your server and require maintenance. This session examines how indexes work and the basic maintenance that you should perform to ensure your system is running at its peak level.

Monitoring Is Essential

Watch your systems like a hawk. You’ll be glad you did.

An old phrase used in business is that knowledge is power. When you understand more about a particular situation, you can develop solutions and better understand which ones would both suit your environment better and solve any issues faster.

As someone that has worked in a variety of production environments managing all different types of platforms, I have found it extremely career-enhancing to have knowledge about the entire environment. Knowing what is happening at any point in time has allowed me to answer pointed questions about the applications. With comprehensive monitoring set up in database servers, I have often found myself in a position to proactively prevent issues, or make preparations for a quick response when I do find problems.

Over the years I developed a variety of techniques to monitor my own servers, often using third party solutions, enhanced with my own queries. I have learned that each system has its own idiosyncrasies, which often require custom queries to keep an eye on the pulse of the system. I’ve also learned that a regular exception report is the most valuable tool for me as a DBA. I assume most of my applications and scheduled tasks are working well; I just want a list of those things that have broken, or are not performing as expected sent to me every day, using the data from my various monitoring tools.

I have found that tracking metrics and analyzing the data is a valuable tool in my DBA tool belt. I would go so far as to say that monitoring the systems that you work with is not only a best practice, it’s essential if you want to be an exceptional DBA.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

Little Devices

It seems that I get more and more done with my smartphone all the time. I find myself responding to lots of emails, having text or Twitter conversations, taking pictures, reading, even making notes on my iPhone. I can’t imagine every going back to a non-smartphone device. Whether it’s an iPhone, Android, or WinMo 7 device, I have found that they make me quite productive throughout the day.

I was just starting to get away from plain phones when I last worked for a large corporation. These days, I see many of my friends sporting smart phones, and getting work done on them. I even have one Windows administrator friend that carries an iPad around work, RDP’ing into servers to handle small tasks.

So I wonder how many database people can get things done. Since it’s Friday, I thought that it would make an interesting poll:

How much of your work could you get done from a mobile device.

I’m wondering if there is a chance that you can see a good portion of your administration, monitoring, or even daily work done from some device that isn’t a full size laptop? It could be a smartphone (Android, iOS, WinMo) or even a tablet (iPad/Galaxy) that allows you to actually knock things off your to-do list when you are away from your desk or don’t have a laptop.

If you’ve never used a mobile device for doing work, please don’t complain about them. They are great devices and they’ve made me very productive when I can’t sit at a desk or carry a laptop. I think many others agree, and I’m interested in seeing how use mobile devices.

Steve Jones

If you’re a DBA, you might want to check out SQL Monitor, from my employer, Red Gate Software. Maybe your boss will let you get a copy and an iPad to manage your servers.