Zero Downtime for 2016

I’ve never felt that zero downtime is possible for a system over any length of time. There are no shortage of companies that seek to prove me wrong, and some are doing very well. For example, when is the last time that the Google search engine was down? It happens, though I haven’t seen this in a long time. Many of the highly available applications out there are built using as distributed applications across many machines, so that even if there are failures, they don’t cascade to interrupt users.

I was thinking about this as I saw a post asking database engineers to architect for zero downtime in 2016. That’s a good goal, and certainly if you work at a high profile retailer or service company, you should look for ways to improve availability.

In fact, I would guess that anyone struggling with specific events, like Black Friday, would be working on this problem constantly. I remember years ago Michelle Ufford spoke about the challenges at GoDaddy during the Super Bowl due to the advertisements the company ran. They had log files that couldn’t catch up to the load for hours and spent an entire year working to build a better database system.

Ultimately I think the best way to handle large spikes of activity is by building an application that avoids putting large spikes of activity on your database. Use messaging and queues to buffer traffic. Use read only copies of your database for traffic that doesn’t need to write to the main database. Anywhere that you can limit the load on your system can help prevent the database becoming a bottleneck for your system.

Perhaps more importantly, if you can spread the load, you don’t need to purchase more and more hardware. If nothing else, I think this is a good argument for better database architectures for applications. However if you’re like me, most of your systems will have to deal with the hardware that is available. In that case, the best you can do is write better T-SQL and ensure you have given SQL Server enough, but not too many, indexes.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.6MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

T-SQL Tuesday #60 – Something New Learned

tsqltuesdayIt’s been five years, and that’s amazing. Not many things last for a year, much less five, but the T-SQL Tuesday party, started by Adam Machanic (B|T)., has been amazing and lots of fun for me.

This month, Chris Yates hosts and his theme is Something New Learned. It’s a great topic, especially given the aims of T-SQL Tuesday to spread knowledge out in the world and share it with others.


This is good timing for me as I took a one day class last week. At the PASS Summit, I spent Tuesday in Allan Hirt’s A to Z of Availability Groups, which tries to help you understand AlwaysOn and the Availability Group portion of the technology. It was a great experience, and Allan did a fantastic job of walking through an overview, and then details. BTW, I can see how this would be amazing, and while I haven’t been, I’m sure that Allan’s Mission Critical SQL Server classes are valuable ways to learn this stuff in a hands on environment in a way that should help you be productive quickly on your own systems.

The most interesting part for me is that this wasn’t an all day class where I listen to lectures, follow along in a workbook, and then move on. There were actual labs, and not labs that meant I downloaded scripts onto my machine and worked on them. Actually 3 node labs, dedicated to me, with instructions on how things were configured.

What did I learn? Quite a bit, but for me the big mysteries that I’ve struggled with on AlwaysOn setups have been some of the permissions. Going through the labs, and getting the permissions necessary in the AD domain. The few places I had issues in the lab exercises were almost all related to a permissions issue I missed or had set incorrectly.

I also went through the advanced versions of the labs, specifically to practice using Powershell for some config items. This was the chance to practice some skills and try to learn a bit more about how I can use PoSh for real world tasks. While the GUI might work well, I know that if I wanted to ensure I could build and create a lab in short order, or on demand, I’d really need PoSh scripting to ensure it was done correctly, and repeatedly. The lab reinforced that.

I also learned a bit about a better way to teach. I’ve been in a few classes and lots of sessions across the last few years, but this dedicated lab environment really made things much easier for me. The hands on work was valuable in actually working through the concepts. In fact, I’ll be going through it again today as I have access to the labs for 10 days, and the workbook, so I can set up another Availability Group today and see the things I’ve done wrong in my own lab setup.

Multiple AGs

Today was my pre-con with Allan Hirt on AlwaysOn. It was great, and I learned a few things. I knew a bit about AGs, but Allan has a ton of experience, and he talked about some architectural things that I didn’t understand.

We had virtual labs, which worked flawlessly. I managed to configure AGs with GUIs, PowerShell, and a bit of T-SQL. I configured multiple AGs in the lab, and finished early, though I had one PoSh error that stumped me a bit. I’m looking forward to resetting the lab and trying again on Friday, taking a few hours to try and rebuild AGs in my own way without help.

Though I suspect I may be sending an email to Allan before this is over.


Most of us that work with technology hate downtime. We don’t want a system that we’re using to go down. We don’t want any software that we depend on to fail when we need it. Most of all, we don’t want our phones ringing because some system we’re responsible for has gone down. We do everything we can to keep our applications online. We avoid patches. We try to test as much as possible before deploying changes. We also may apply generous amounts of hope and prayer.

However that’s not how all companies run their internal systems. Netflix has taken the opposite approach, actually creating downtime for some of their systems using what they call a “chaos monkey,” and they think it could help you. To be fair, Netflix doesn’t take their entire application offline, but they do cause failures in the hardware and software, specifically to see if their redundant and scaled-out architectures can limit the impact on users.

It’s an interesting idea, though one that I’ve not seen many companies be willing to implement. Netflix thinks you could benefit from it, but they also run a series of services that are scaled our across many machines. Many companies I’ve worked with have services on one machine handling an application, and they accept the risk that a system might fail and users will experience problems. Given the quality of modern hardware, that might be a good bet to place these days.

However more and more of us are running redundant systems for some applications. If you think the Chaos Monkey could help you, let us know.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.1MB) podcast or subscribe to the feed at iTunes and Mevio . feed

The Voice of the DBA podcast features music by Everyday Jones. No relation, but I stumbled on to them and really like the music. Support this great duo at

The Decision to Fail

failoverWe have a number of automated technologies that we can us to seamlessly move from primary to secondary systems without human intervention. SQL Server incorporates a number of these, and many companies use them to ensure their applications are highly available. However things don’t always go as planned in a disaster and sometimes humans get involved.

Unless you are one of the companies with a very large budget and high risk of business issues when systems failover, you probably have some sort of high availability (HA) or disaster recovery (DR) process that requires human intervention. Log shipping, for example, usually requires that some human reconfigure the application to use secondary servers. Even with Availability Groups, clustering, or database mirroring, you may need to manually fail back to primary systems.

In those cases, it’s not always a clear decision to do so. Many of the switches are disruptive, or have the potential to be disruptive. Cluster fail-overs should not impact the application, but there is a brief period where clients may not connect. Outside of disasters, Management, and often technical people, usually want to schedule any failovers after they have prepared the end user for potential issues, however brief.

In disaster situations, when there hasn’t been a complete failure of a system, you may not want to have unscheduled failovers right away. This week I want to know:

How do you make the decision to fail over from one system to another?

I’m speaking to you, the data professional or the administrator. I would guess that most of you are not the one that ultimately makes the decision to leave your primary systems. Often I’ve found that someone in management has to make the decision, but with input from the technical people. In that case, think about how you present the situation and pros and cons of the failover. Do you give hard numbers, like latency and relative CPU power in failover machines or do you attempt to quantify the effects on the business when secondary systems are in use.

I’ve rarely had a large budget for secondary systems. Network bandwidth, CPU and memory, and more are sometimes sacrificed in secondary systems in order to align the cost of these systems with the risk of needing them. In many cases, we didn’t have automatic failover for many systems because we had to know our primary systems would be down for more than 5 or 6 hours before we would switch to the backup environment.

If you have similar guidelines or processes in place, let us know.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

Understanding HA

51JU2kM8vHL._SL160_PIsitb-sticker-arrow-dp,TopRight,12,-18_SH30_OU01_AA160_I’ve learned a lot about high availability (HA) and disaster recovery (DR) over the last two decades. Some of the things I’ve learned were from studying, some from experimentation and practice, and a few from hard lessons when some part of a system failed. The latter were how I learned the quickest and most effectively, but those were also the most stressful times in my career. Getting called by senior management because a system is down is not the way that I want to improve my skills.

I know many people struggle with HA and DR, which are two different concepts. They are designed to solve different problems for different situations. SQL Server includes a number of options for how you might handle either HA or DR situations (or both), and the list of options continues to grow over time. We have the AlwaysOn suite of technologies, database mirroring, replication, log shipping, and more.

All of the technologies that you might use to solve HA and/or DR problems may provide other benefits. The ability scale out, or spread load are a possibility with a few of these features. What I find, however, is that many accidental, junior, or otherwise inexperienced technology professionals that need to manage SQL Servers get confused. They seem to think HA and DR are the same thing, and one technology is somehow better than others.

Whether you need HR or DR, or some other solution, it’s important the you research, understand, and practice working with the technologies that you will implement. Books Online gives you a good starting point, and there are plenty of online resources where you can ask additional questions or debate the particulars of your situation.

Disasters rarely occur, but you should take the opportunities to hone your skills at working with the various technologies you may implement before things fail. Schedule failures of systems in your test environment. See if your HA technology keeps the system running if you remove a system, or test that you can recover using your DR techniques. When something actually goes wrong, you’ll be glad you prepared.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

AlwaysOn and Contained Databases in SQL Server 2012

This talk is mostly based on contained databases in SQL Server 2012. The first part of the talk goes over the problems with current database movement, the idea of containment, and the implementation in SQL Server 2012. There are demos that cover the user and collation scenarios. The second part of the talk goes into the basics of AlwaysOn. I talk about the architecture, the various scenarios you can have, and the advantages and drawbacks of the technologies involved. There are no demos of AlwaysOn mostly due to time.

Length: 60 minutes

Slides: downloadable after the SQL in the City 2012 tour.

Code: Demo code

The High Availability Poll

Always On Diagram
Is it worth it to you to build a highly available database?

The cost of building highly available databases dropped with the introduction of Database Mirroring in 2005. Without the need to purchase identical hardware for a spare system, it became much easier and less expensive to set up extra systems that could handle a workload in the event of a disaster. This technology was improved in SQL Server 2008, and in SQL Server 2012 we have a new option: Always On.

When I first heard about the changes coming in SQL Server 2012, I thought for sure that many people might upgrade to take advantage of the HA enhancements and provide more stability for their systems. However as I see the actual changes in detail, and talk to more and more data professionals, I’m not so sure. Many of us have spent years building systems that tolerate most issues meet the majority of our needs. With that in mind, I had a question for the DBAs out there.

How important is a highly available database with automatic failover to your employer?

Having a spare system that automatically picked up the load when a database server failed would seem to be on the wish list of every user, but plenty of environments don’t implement one. Many companies desire such a configuration, but when they see the cost or complexity, many forego an implementation.

This Friday I’m wondering if your business sees this as critical and has implemented such a system, or are you more tolerant of minor issues, and willing to accept some downtime in a simpler environment.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

High Availability Upgrades


Is SQL Server 2012 worth the upgrade for you?

SQL Server 2012 will be released soon. I have no idea of the exact date, though I suspect a peek at the SQL Server developers’ vacation schedule in Redmond might provide some clue. The announcement last year was a first half of 2012, so I do expect to see the product RTM sometime between now and July 1.

One of the highly anticipated features in SQL Server 2012 is the Always Onfeature, which will dramatically increase the ease with which we can build databases that can not only tolerate hardware failures, but be deployed (physically) further apart, and with more flexibility in how backups can be taken. If you haven’t read about Always On, I’d urge you to start looking at this feature and see if it’s something that can help your company.

Unlike Database Mirroring in SQL Server 2008 R2, which is limited to one secondary database, Always On will allow us to have multiple secondaries. The current mirror databases are unreadable, unless you count the kludgy database snapshot feature. In SQL Server 2012, however, we can actually use the secondary databases to query, providing some limited scale out capabilities. I think that’s fantastic, and it’s a large step forward for the SQL Server platform. I’m wondering if your company feels the same way.

Will the read only secondaries available in SQL Server 2012 convince your company to upgrade?

There are lots of other features in SQL Server 2012, some great advances in other parts of the platform that might be worth upgrading for, but I regularly see people struggling with scale issues. With the changes in licensing as well, these new scalability features are worth examining, and I’m wondering how important they are for your business. Important enough to justify the price of an upgrade? Let us know today.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

The Chance of Failure

I’ve always worked on the notion that my hardware rarely fails, and while I protect it with backups RAID, etc., I don’t expect failures. That seems to be how most of the RDBMSes are structured, with this idea that things will work well and we have protection mechanisms when they don’t.

Contrast that with the way Google viewa the world. To them,failure is inevitable at scale, as noted by  Jeremiah Peschaka in The Promise and Failure of Federated Data. At large scales of deployment, Google and other companies assume that there will be a percentage of failures and they have to account for them. This is the same type of accounting that restaurants use (spoilage) and retailers (shrinkage) use for allow for some amount of loss.

In technology, we account for potential failures with RAID, with HA or DR technologies, and hopefully with substantial testing to ensure that we have properly accounted for potential failures. However it seems that most technology people account for failure as a possibility and not a probability. Many people seem to assume that a serious disaster is not likely in their career.

I think that a catastrophic, we lost the whole data center, event is unlikely for most of us. As Hurricane Katrina and the recent earthquake in Japan have shown, that is possible. However it’s unlikely for most locations, which is a good thing.

Failures are inevitable and whether it’s a disk corruption, a server crash, or a building power failure, we have to assume we will experience one and plan for the event. We also have to expect hardware will fail, which means regular checks and monitoring to detect these failures as soon as possible.

Hope for the best, but plan for the worst.

Steve Jones

The Voice of the DBA Podcasts