Data Loss or Downtime

This editorial was originally published on Jan 7, 2011. It is being re-run as Steve is on vacation.

I was watching Kimberly Tripp of SQL Skills talk recently about VLDB disasters and how to recover from them. One of the first things she said in the session was getting a damaged database back online, even without all of the data, was important. Often her clients need to keep working, and it is important that they get the system back online, even without all the data. This allows business applications and business people to get back to work.

That is interesting. I had always thought of my production OLTP databases as needing to be online, but also needing all the critical data. After Mrs. Tripp’s talk, I had to rethink that a bit and consider that a little data loss might be acceptable.

To me this is a topic that is worth understanding. At the very least, it will help you make decisions in the event of some disaster for how you will proceed. So for this Friday’s poll:

What is more important to you: downtime or data loss?

My feeling is that most of the people would really rather have the database online, even without all the data so they can continue to work. I realized that most of the time, getting the site back up, having lookup and other types of ancillary data (like products, prices, etc), was the most important thing. Recovering other data such as older orders, was secondary.

Once the database is up, you can then work on getting other data back and merging it into the production system.

Let us know what you think this Friday and what’s more important to your business (and why).

Steve Jones

Two Days Off

I almost couldn’t believe this when I saw the article. The Verizon Cloud is shutting down for 48 hours. Apparently they have maintenance scheduled for this weekend and notified their customers that their virtual machines will be shut down early Saturday morning. There are some legacy Verizon cloud-type services that will be available, but the platform they’ve been pushing to customers will be down.

This isn’t good news for Verizon or their customers, but it also doesn’t help the cloud overall as a service. This outage reinforces the idea that reliability isn’t necessarily better for vendors than individuals. If costs for the cloud are anywhere near that of on-premise hosting, this event would certainly make me think twice about moving anything really important in my organization to a single cloud vendor.

I suspect that most cloud vendors have outages like this, but they don’t shut down their entire clouds. When a large amount of maintenance is needed at a data center, most vendors would migrate customers to a separate data center or another part of their cloud while they perform their work. Either this mainenance is a major change to Verizon’s entire infrastructure that can’t be staged on just a part of their system, or they’ve poorly planned their architecture and maintenance.

Either way, this weekend will certainly be a good DR test for enterprises that might have important applications hosted with Verizon. It will also be a chance to test how these clients notify their own customers of potential issues or how they respond to problems. I wouldn’t want to have an application hosted with Verizon as it would be a lot of work for me and likely a weekend away from family, but I know I’d learn a lot about how well I’ve prepared my own systems for fault tolerance.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.2MB) podcast or subscribe to the feed at iTunes and LibSyn.

Test Your Situation

I gave a talk on Transparent Data Encryption (TDE) recently and a number of people in the audience were using the feature. However when I asked how many of them had restored a TDE database, not all hands remained up. When I asked how many people had restored their TDE encrypted backup  to a different server, one that didn’t have TDE enabled, very few hands remained in the air.

That’s not good, and I certainly hope those people don’t experience a disaster from which they cannot recover. I’m sure they are not alone. I suspect that many of the people managing a TDE database have restored a database this year, and are confident they can do so. However what they don’t know is if they can restore those TDE databases on any other instance, including a newly installed one.

They’re not alone. I see many, many people implement features they don’t really understand. Microsoft has made it easy to set up replication, clustering, and more in your environment, but without providing some of the robustness and reliability that many people need. The ease of setting up a feature is one thing. The ease of ongoing management and recovery when issues occur is something else entirely.

I really wish that Microsoft would go further than making implementation easier and include direction for ongoing tasks. When databases are created, ask the user to set up backups and help them create the jobs. When encryption is implemented, do more than display a warning message. Help adminstrators prepare for recovery with templates or jobs that automatically build certificate or key backups. When replication is set up, include a script to rebuild the environment for when it breaks.

I doubt we’ll get this, and many companies and employees will continue to implement features they don’t understand. You can only help your own situation, and you should be ensuring you understand and can rebuild all the extra features you’ve installed in the event of a disaster.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.3MB) podcast or subscribe to the feed at iTunes and LibSyn.

Drop and Recreate

Those of you that manage replicated environments have learned to have one thing handy: a script that recreates your replicated publishers, distributors, and subscribers. I was reminded of my past needs for those scripts recently when I saw this post on dropping and recreating all the synonyms for a database.

It’s easy to depend on backup and restore to recover from issues, but how often do you face problems with an environment that aren’t related to data? If you lose a stored procedure, or have a problem with the configuration of jobs, or principals, can you easily drop and recreate an object? That code is usually tiny, but if the only copy you have is in a backup file, you have to restore a lot of data just to get some code.

Certainly everyone should keep all this data in version control, and I’d encourage you to be sure that not only development code (tables, views, stored procedures, etc) are kept in there, but also configuration settings, jobs, roles, and the various other things a DBA is responsible for in a production environment.

However I’d also go one step further and ensure that you have scripts to recreate all aspects of your environment if you need to do it. Many of the comparison tools will let you store a snapshot of database schema items in a folder and then easily help you recreate a script if needed. That covers the database, but for the instance level items, you need to be sure you have an easy way to checkout a copy from version control (you are using version control now, right?) and execute the scripts on a SQL Server. You can use T-SQL, PoSH, or even VBScript, but be sure you have the code handy.

Do you?

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.8MB) podcast or subscribe to the feed at iTunes and LibSyn.

A SQLServerCentral DR Event

We had a disaster at SQLServerCentral this past weekend. It wasn’t a big disaster, but it was an event that required a restore of data.

An Administrative Error

On Saturday, I was attending SQL Saturday #331 in Denver. In between my sessions, I was prepping a few demos and finished getting ready earlier than expected. Since I had a few minutes, I checked my email and immediately knew we had an issue on the site.

We’ve been fighting spam for months, slowly tweaking our posting system in the forums. With the start of the American football season, we’ve been getting hundreds of posts every Friday and Saturday. I’ve tried to ensure these posts are removed before our newsletters are generated so that they aren’t filled with advertisements, but it’s been a chore.

One of the things I can do in the forums is select a series of posts and mass delete (or open, close, hide, etc) the group. For most of the SPAM posts we receive, the posts all occur in the same few minutes and are grouped together. I’ve gotten in the habit of deleting these batches of posts, watching for a legitimate SQL Server post at the end.

However on this day, one of the our regular threads was buried in the middle of all the SPAM posts. This was THE Thread, the most active and long lived discussion on the site with 45k posts. I inadvertently deleted the post and went on to give my presentation.

Afterwards, I got a private message from the site, telling me the discussion had been deleted around 2:00pm MST.

Quick Reaction

The first thing I did when realizing what had happened was connect to our production database cluster through VPN. When I opened Management Studio, I ran a few queries to verify the discussion had been deleted, and not just “marked for deletion”. Logical deletes exist in many applications, and if this is the case in your own disaster, the last thing you want to do is initiate a database restore.

In this case, the data was gone, so I immediately tried to initiate a restore. Since over an hour had passed, I didn’t want to restore over top of the current database. Instead, I wanted to restore a copy as a new database, as of 1:45pm or so.

I selected the proper options, marking the full backup from overnight and the log backups throughout the day. I didn’t have time to worry about using STOPAT and trying to get close to the actual time of data modification, so I choose the last backup I knew would be good. Verifying the database name was a new name, I clicked OK.

And nothing happened.

Actually, that’s not true as I got an error. The backup system in use by Rackspace, our provider, doesn’t keep the files available from SSMS. I don’t have rights to work within the restore system, or even request one from Rackspace, so I opened a ticket with Red Gates support for a restore.

Had this been a situation where the site was down or users were unable to read articles or post, I would have escalated this for immediate action. However since this was a restore of a single thread, and one that exists for entertainment more than education, I chose not to bother our IT staff on a Saturday night or Sunday.

The Fix

When I woke up Monday morning, I had a message that the restore had been completed to my specified new database (SSCForumsOld) as of 1:30pm MST on Saturday. I hadn’t asked anyone to do more than this, so this was the extent of actions taken by Red Gate.

Again, I could have specified actions in more detail, but rather than try to explain to someone in email which thread, and which posts needed to be restored, I decided to handle this myself. After taking my children to school, I sat down and got to work.

I’ve known the PK of this particular thread since I’ve had to work with in in the past. Connecting from Management Studio to the production instance, I verified I could see the 45k messages in the SSCForumsOld database. I ran the same query on the SQLServerCentralForums database, and validated the data was still missing. I then built a query that would perform an INSERT..SELECT of the parent posts from the restored database to the production system. This took longer than expected, with the table having a number of locks for about 2 minutes. However the post details had been moved.

That left me with the need to move the actual words of each post, which are stored in a separate table (for some strange reason). Rather than lock up the forums for minutes, I spent time rewriting my next insert to use batches of 1000, and only insert those messages which had not already been moved. Since I could join on PKs, this went quickly, in a few seconds. I next changed my batch size to 5000, and this completed in about 15s.

That seemed like enough time to run quickly, but also a good batch of data and manually executed this 9 times to move all the data. A quick check on the site showed THE Thread was back and I posted a few notes to let users know.


There are still a few issues with points for users that have posted to this topic being incorrect, but that is a lower priority item and I am letting our developers look at it. There is at least one known bug with points, and it’s possible we have another here.

My personal lesson learned was that I need to move a little more slowly when removing SPAM. I also don’t want to trust myself to do it regularly, so I also spent part of Sunday morning writing a little code and scheduling a job to delete posts with certain patterns of titles that the spammers use. I tried to limit to those obvious subjects so that no legitimate posts are removed.

We are also escalating some of the issues with SPAM, and with the return of my manager from sabbatical, I’m hoping we can build a few more filters to limit the disruptions in the forums.

And of course, everyone that posts to THE Thread has my apologies for the mistake.

Reverse Engineering Disasters

If you are responsible for managing systems, you should have some sort of disaster recovery plan. Even if you are only managing the one system you carry with you on a regular basis, you should ask the question: What would it take to destroy this data?

It’s a good question, and in the post I’ve linked above, a technologist talks about some of the failings of disaster recovery plans because they forward engineer plans to recover systems. People think of specific problems and try to prevent them. They don’t reverse engineer to find out what events would cause them to lose their system.

That’s really the key to a lot of design and architecture in computer science. You can’t think about just what you expect to happen, or even what you want to happen. You have to consider the other ways in which events could occur or the issues that could cause an problem in your system. In development, the things you expect are considered to be the happy path. A good architect or engineer will think about everything else in addition to the happy path.

I sometimes think that far too many decisions are made while considering the happy path, but ignoring or discounting the other paths available. These other paths may be unlikely to occur, but having worked with computer for over 25 years, I can tell you that I often find the least likely events occurring far too frequently.

As the author says, you should never say “I never even thought of that happening.” Consider reverse engineering problematic situations and then make a judgment of how likely is it that any of the events will occur. That way you have at some idea of what could go wrong and what events you are willing to protect against.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.1MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

The Voice of the DBA podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at

Risky Backups

This editorial was originally published on Jan 13, 2009. It is being re-run as Steve is away at the SQL Intersection conference.

The other day I was working in the forums at SQLServerCentral and saw a question about someone doing backups and having issues. While myself and others were trying to help, I recommended that this person not run multiple backups to one file. It’s something I’ve done for years and I’ve recommended, but the person followed up with a question about why I think it’s important.

My first answer was that it’s an unnecessary risk. If something happens to the file and you’ve stored multiple backups in this file, then you lose all the backups. I’ve seen this happen with older versions of SQL Server and after that time I made it a point to never stick multiple backups in one file. To me it’s creating a single point of failure for no reason.

But as I wrote this, I started to question myself a little. I know that some people like having the same filename every day as it makes restores simpler, especially if they’ve scripted things.  I know that file systems and hardware is often redundant and files are generally safer in the short term than they were in the past, so is my advice outdated?

I still don’t think so because in my mind there’s a great risk of issues here and that outweighs the convenience. Not to mention that some simple  scripting can make restoring to something like QA easy enough for anyone to run.  However I’m curious what the rest of you think about storing multiple backups in one file.

Steve Jones

The Voice of the DBA Podcasts

Everyday Jones

The podcast feeds are now available to get better bandwidth and maybe a little more exposure :). Comments are definitely appreciated and wanted, and you can get feeds from there.

Overall RSS Feed:  or now on iTunes! 

Today’s podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at

The Need for Tape

This editorial was originally published on Mar 26, 2009. It is being re-run as Steve is away at DevConnections.

One of our SQLServerCentral authors sent me this post, which I found to be pretty amazing. Apparently hackers broke into a backup server at WebHostingTalk, which is a community for hosting providers to discuss issues. They deleted backups first, and then moved on to deleting tables from the database.  It resulted in some downtime, and then a restore of an old copy of their database prior to trying to restore more recent copies. More information from the admin is here.

Now that’s just malicious and nasty.

Hopefully this wasn’t an inside job, though I could see that as a possibility. For a service like this, that just provides a place for a community, let’s people interact and talk, this is just vandalism. It serves no purpose. It’s likely no one even knows it was “M@M@sB0Y” or some other hacker, so there’s no fame, and it disrupts people who just want to converse with colleagues.

I’d hate to think about this happening here. We have lots, and lots of posts from people all over the world, and while it wouldn’t kill us, it would really annoy many people that have volunteered their valuable time to help others by losing their work. We definitely need to make sure we don’t have an issue here.

I see two takeaways from this incident for most DBAs and administrators. First be sure that your backup servers are just as secure as your production ones. There is production data in the forum of backup files here, and you should be providing as much security for them, and perhaps more, as on other servers. Don’t treat these servers lightly.

Second I think this shows that there’s a need for tape backups, or some type of non-linked backup. A tape grabs the files from your server, completely separately from the SQL Server (or other application). It is a pull link, and it’s not obvious from the source server where these files have gone. That’s good in that it prevents some type of attack on the main server from propagating on. Most people use a push from their server to a remote device as part of the backup process. That’s OK, but it provides an easy link for someone to attack the backup server from the main one.

And one more benefit of tape? Usually they’re rotated out, so even if a hacker or insider knows how to get to the backup system, without physical access they can’t touch your tape.

Steve Jones

The Voice of the DBA Podcasts

Everyday Jones

The podcast feeds are available Comments are definitely appreciated and wanted, and you can get feeds from there.

You can also follow Steve Jones on Twitter:

Overall RSS Feed:  or now on iTunes! 

Today’s podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at

I really appreciate and value feedback on the podcasts. Let us know what you like, don’t like, or even send in ideas for the show. If you’d like to comment, post something here. The boss will be sure to read it.

Disaster Recovery Docs–Offsite

Years ago I worked at an ISO certified organization, where we took our administration and planning seriously. We ensured everything was backed up, we had backup drills, we each practiced recovering systems, loading tapes, even calling the company that rotated our tapes from an offsite location.

One of our administrators was very conscientious and documented everything. He had screen images, copies of vendor documentation, all linked and labeled in a great folder on our network. We used this documentation constantly while restore systems from mistakes or drills and updated it regularly. It contained vendor contact information and account numbers. Each of us could access is remotely, and I had used it a few times from home while working on a problem with a system.

One day our auditor came in and in the course of examining our details remarked on the comprehensive documentation and said it was great. Then he reached over and turned off our very detail-oriented administrator’s monitor.

“Now show me your documentation,” he said.

Our administrator was a little shocked. All the documentation was on the network. We’d accessed it from home. We accessed it from other corporate locations. We had restored it from tapes.

We’d never thought it would be unavailable.

It was a sobering lesson, one that had interns printing off copies that afternoon, dating them and binding them into books, one of which was always rotated offsite to the tape vendor’s facility.

As for the tape vendor information itself? After overcoming his embarrassment, our administrator made wallet sized cards with the vendor company, phone number, address, and account number. He laminated those and gave us each one to keep on our person at all times.

Make sure you have copies of the stuff you need offsite. It’s easy not with the “cloud”, though you better secure the data with encryption and limit the key access to a few people. You never know what you’ll need to recover, but you never should  find yourself in a position of not being able to recover your documentation.

DR Failovers

failureAlmost everyone struggles with setting up disaster recovery plans and resources. There are a few companies that take their DR seriously, but for most organizations, it’s an afterthought. It’s an insurance premium that can easily be avoided if there are not pressing problems and your past experience with disasters is minimal. After all, it’s rare that any of our data centers shuts down because of an earthquake, hurricane, fire, or other similar large scale event.

However most of us try to have some type of disaster recovery in place. We may have cold or warm systems available. Our companies have funded an AlwaysOn Availability Group, or more likely, mirroring and/or log shipping for critical systems where data is moved to a remote location on a regular basis. We monitor these processes, and we try to keep them running, though I’m sure if they break, many people don’t give the repair top priority in their daily work.

A DR environment is like a backup. If you don’t test it, you’re never sure if it really is something you can use in a disaster situation. You may periodically test your fail-over, like you test a restore, but do you ever really lean on the secondary system? This week, I wanted to ask you this question:

Do you fail over to your DR system and run your business from the system for a day (or week or longer)?

I know a few companies that consider secondary systems to be critical and will actually fail over, and then run the other system for a few months, failing back to then retest the primary system. In this case, there really isn’t a primary and secondary system, but rather two systems that can work, being alternately used throughout the year.

This is actually moving closer to a cloud architecture model, where you don’t place high importance on any particular system, you assume any system can fail, and you have redundant systems that can pick up the load. In a cloud environment you might have more than two, relying on dozens of systems instead, any of which could fail, but with a small interruption in service.

I’d hope that SQL Server, the version of the platform I can install in my data center, would get close to this, allowing me to serve database services to clients, but seamlessly moving those services across instances, with clients unaware when physical machines have crashed because their services just moved to another host.