Updates During Outages

Slack went down on June 10, 2016. I noticed, which is funny, since I don’t necessarily use the tool every day. I tend to still communicate with email, but more and more people like using Slack at Redgate, so I do pop over there. It’s also a good way to drop a quick note and perhaps get a quick reply. In this case I’d asked a group to do something and hadn’t heard back. Since my request didn’t generate a ticket, I didn’t want to send another email, which could result in more delays if someone isn’t processing email. However, that’s not important.

What was interesting is my Slack web page didn’t connect, and when I pinged @slackhq on Twitter, they directed me to status.slack.com. That’s where they post updates. That site was also down, which somewhat defeats the purpose of having a separate location for updates.

I’ve experienced this a few times, where someone (sometimes me) has built an update or notification mechanism that is dependent in some way on the actual service we’re updating. Often this has been because the same http server is being used, but sometimes it’s the same database instance is used to allow non-technical people to post updates. In all those cases, at some point the update mechanism has broken.

I’ve learned to actually separate my update broadcast mechanism from the production server. We’ve done this in a few ways. I’ve had includes of a simple text file in Web applications in addition to a static page that can be served from a web server. I’ve learned to use a separate physical host that can be moved to the proper IP address in the event that our firewall or load balancers don’t work. They key, I’ve learned, is separation. Have a separate resource that can manage a simple message back to users. Perhaps even a small database that can respond to queries with a “we’re down” reply.

Downtime is never good for users, and rarely are people pleased with being unable to access their system, but good communication goes a long way to soothing the hurt feelings. Most of us accept that systems go down and problems occur. What we’d like is a short note (and updates) that let us know something is being done.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.2MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

What’s Your Test Plan?

I ran across a post on upgrading a SQL Server instance, where the original poster (OP) was asking about a document to upgrade from SQL Server 2008 to 2014. That’s a big ask, especially as not many documents tend to be written to go across three versions. The official ones, or the people that document well, tend to upgrade every version, and so they have a document to go from 2008 to R2, and from R2 to 2012, etc. However, given the pace of Microsoft releasing things, perhaps we ought to build a document and checklist across every 2-3 versions since many people may be upgrading from 2005/2008 to 2014 (or soon, 2016).

Apart from having the document, one of the questions was a list of what to test. That struck me, as I’m big on testing, and have tried to keep comprehensive plans when I had important systems. However, many of you might be like me and don’t consider most systems to be “important”. For those systems, a patch, a Service Pack, an application upgrade was really tested by applying the change to a test server and asking users to see if they could use the application. I’m not confident that there was any sort of comprehensive look at the system in these cases, but this system worked most of the time.

There were some instances that we deemed important, usually measured as such because a failure would mean some high level manager would call my manager’s boss and smelly things would slide in my direction. In those cases, we had a list of the functions and processes that needed to work. These could be application functions, queries, ETL packages, reports, or anything that would cause a user to complain. This list became our test plan, and it was kept up to date. Usually back dated, since we weren’t sure what new things were important until they failed for some reason, but once we received a ticket on an item, we added it to our list. We went through the entire list for upgrades, ensuring each item worked.

I’m wondering, do many of you have a test plan for your systems? Any system? It doesn’t matter if it’s automated or manual, but if you had to patch/upgrade instance X, are there a list of things you’d verify? Or is the system not that important, so you’d just make sure the database service was running? Let us know what your test plans look like.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.1MB) podcast or subscribe to the feed at iTunes and LibSyn.

Solving Tempdb Issues

While reading Remus Rusanu’s piece on SQL Server 2016 development, there was an item that struck me. There’s a part of the piece where he notes that an engineer at Microsoft realized that there was a tempdb issue with Azure SQL Database. To solve it, a failover was needed, which could be problematic. The basic issue was that tempdb files were sized unequally, discovered after “months of investigation.”

Now, on one hand this seems silly. After all, we’ve known since SQL Server 2005 that the guidance was for all files to be sized the same. Shouldn’t engineers at Microsoft be following all the practices  known for optimizing SQL Server performance? I think all Microsoft people should follow this, especially those working on other products (*cough* Sharepoint *cough*), but at the very least SQL Server engineers should have a huge list, perhaps with PBM or unit tests, that warn about non-compliance.

On the other hand, since we’ve known this is an issue since SQL Server 2005, why does SQL Server allow this? I’d think some simple tooling and a few checks in code could eliminate this as an issue in any new install. Catch the issue during upgrades and grow (or shrink) files to be the same size, or at least warn the administrator. I know there are situations where you need to add a new file in an emergency, but shouldn’t this be an item we push administrators to correct quickly after the emergency is past? Or at least force them to consciously make an exception.

There are plenty of limitations and restrictions in SQL Server systems that Microsoft forces on us. We have limits on where new features can be used, various historical limits on function parameters, and more. Why not also just enforce limits that prevent issues like this? I’m sure people will complain, but wouldn’t those complaints be outweighed by less issues overall from all customers?

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.3MB) podcast or subscribe to the feed at iTunes and LibSyn.

Stress Testing

Many of the DBAs that manage production systems will at some point determine what level of hardware is needed to support a workload. Whether this is a physical server purchase or a cloud “rental”, someone has to decide what hardware is needed. How many cores, the amount of RAM, the number of disks, which hopefully correspond to some level of IOPs, and more. Even in the Azure SQL Database world, you must decide what database capacity you will pay for.

Since this is a big decision, and changes can be hard to make, many DBAs overbuy hardware. After all, no one wants to have a slow server. This is true for Azure as well, at least for many people I know. While changing from an S3 to a P2 is quick and easy in the Azure portal, it’s not such an easy sell to management. If they’ve budgeted $150/month and you tell them we want to go to $900/month, the technical change is the easiest part of this.

As a result, I’m surprised that we don’t really have better ways to determine if hardware will support our workload. I see this question asked all the time, and although there are tools and techniques suggested, I’ve yet to see many people have a set, known standard way of evaluating hardware and a particular workload.

One one hand, I think there should be better tools to do this, whether from Microsoft or someone else. I suspect since this is such a rare activity and businesses have been willing to overbuy hardware (or deal with substandard performance), that there isn’t any large impetus to solve this issue.

However I wanted to ask if any of you actually stress test hardware? Either your current hardware or new purchases. If you don’t know what your level your current hardware performs at, how do you compare that to new hardware?

Do you have a way to replay and measure a workload? Do you have the time to do so when new hardware arrives? Is there a documented method you use? Apart from discussing this today, I’d love to see some articles that detail exactly how you test hardware from a technical tool perspective, and then a followup that examines and evaluates the results.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.2MB) podcast or subscribe to the feed at iTunes and LibSyn.

What’s Your Downtime?

That’s the question this week: What’s your downtime?

I thought about this as I read this piece on Azure’s downtime being greater than its rivals in 2014. You do need to provide an email to read the article, but essentially the article builds on tracking from CloudHarmony to show that Azure had more downtime for the year, with about 54 hours. In Europe, that high water mark was 5.97 hours for compute and 1.31 hours for storage, so the European Azure cloud is doing better than the others.

That’s pretty good. Certainly individual machines went down, and services were unavailable for short periods of time during failover, but keeping a hardware platform up around 5 hours of downtime a year is good. I’m not sure that many of my companies have done this, though to be fair, mostly it’s been patches from Microsoft that caused downtime for Windows machines.

However, let’s look at your database instances. Do you know what your downtime is? I’d encourag you to track it, and be sure that you report on it. Or at least have the data, in case the boss asks. I don’t know what our SSC total downtime is, but I can tell you that our cluster rebooted in Aug 2015 and Mar 2016, brief outages for some patching. In the past, I’ve typically seen our database instances run the better part of a year, usually only coming down for a Service Pack.

If you have monitoring in place, and you should, then I hope you have some idea of downtime for your main databases. If you can share, see if you might set the high water mark for the SQLServerCentral community in the last year.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.4MB) podcast or subscribe to the feed at iTunes and LibSyn.

Restoring a Striped Backup–#SQLNewBlogger

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

Recently I uncompressed some SQL Backup Pro files. Since multiple threads were used to make the backup, I ended up with multiple files in my file system, as shown here:

2016-02-16 11_58_57-Backup

Each of these is part of a striped backup, a piece of a backup file. To restore the backup, I need all the files to reassemble the backup. This is fairly simple, but you should be aware of how this works and how to perform a restore.

In my case, you can see I have 7 files for each database. They are the same name with an _0x at the end, with x being the number of the file.

To restore these, I go to SSMS and click the restore choice. That gives me my restore database dialog, where I can select that I’m restoring from devices. As you can see below, no devices (files) are selected.

2016-02-16 12_00_56-Restore Database -

I then click the ellipsis to get a dialog allowing me to add files.

2016-02-16 12_01_03-Select backup devices

Hopefully you’ve done this before, and you can click “add” to add files. You need to naviate to the location of your backup files if it isn’t the default.

2016-02-16 12_02_02-Locate Backup File - JOLLYGREENGIANT_SQL2014

Next you can select the files. Holding down CTRL, I can multi-select files.

2016-02-16 12_02_12-Locate Backup File - JOLLYGREENGIANT_SQL2014

Once I pick them, I click OK and then I see them all in the device dialog.

2016-02-16 12_02_19-Select backup devices

Now I click OK and SQL Server reads the headers, and I end up with a single database to be restore, as shown below.

2016-02-16 12_04_24-Restore Database - SQLServerCentral

Now, I can click OK, and often do in development areas. HOWEVER, if you are doing this in production, please, please, please, click the Script button instead. You’ll get a new query window, and you can cancel out of this dialog to see the code.

2016-02-16 12_16_07-SQLQuery7.sql - JOLLYGREENGIANT_SQL2014.master (JOLLYGREENGIANT_sjones (59))_ -

From here, you should click “Save” and save this, then execute it.

As you can see above, the statement is simple. List each disk file, separated by a comma. Then the rest of the RESTORE statement is a standard restore.

SQLNewBlogger

This is a fairly simple task, one I’ve done dozens of times, but it’s worth practicing. If you want to write about it, what happens if you’re missing a file? What if you change the order of files? This was a 10 minute blog, and it’s a good chance to play and practice your restore skills, which are as important as anything you do as a DBA.

Creating a Database Snapshot

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

I’ve rarely dealt with database snapshots, outside of lab experimentation. They didn’t exist when I did most of my DBA work, and since then we haven’t seen the need for them at SQLServerCentral, though, I may suggest we add them to our deployment process since we can quickly roll back if something breaks.

However, I created one recently for a test and realized that I’d forgotten the syntax. Unlike the quick "create database xx" syntax I often use, with a snapshot I need to be more specific.

The big item that I must specify is the location of the snapshot file. This is the file that gets written to with the copy-on-write process that ensures the original state of the pages when the snapshot is created are still available.

You do need to give the database a logical name as well, which can be anything, but the reference below has suggestions. I’d say that this is bad:

CREATE DATABASE sandbox_snapshot
ON (NAME = MySnap
, FILENAME = ‘h:\SQL Server\MySnap.ss’
)
AS SNAPSHOT OF Sandbox

But this is better

CREATE DATABASE sandbox_snapshot_20150122_1345
ON (NAME = SandBox_Snap_20150122
, FILENAME = ‘h:\SQL Server\MySnap.ss’
)
AS SNAPSHOT OF Sandbox

Because the snapshot is based on an existing database, at a particular point in time, it’s useful to specify the time when the snapshot was created, if possible. While you can get this from metadata, if you have people that look to multiple snapshots for information, it can be handy to know when each is from.

However if you are providing this on a schedule, like daily, for people to report from, you might need to have the same name every day. Think about this, as you cannot rename a snapshot once it’s created.

SQLNewBlogger

When I realized I had to lookup the syntax, I took a few notes and captured code, which meant I combined this writing (< 10 minutes) with other work I was doing.

You should do the same. When you tackle something new, take screenshots, save code, and drop it in a OneNote/EverNote/etc notebook for your weekly blog writing.

Reference

The references I used:

Disabling SQL Server Network Protocols

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

I ran across a question on network protocols recently, which is something I rarely deal with. Often the default setup for SQL Server is fine, but there are certainly times you should add or remove network connectivity according to your environment.

Here’s a short post on turning off (or on) a network protocol for SQL Sever.

What’s Enabled?

The easiest way to verify what’s enabled is to use the SQL Server Configuration Manager. You’ll need administrative permissions on the host to run this, but it’s easy to find.

2016-01-13 14_59_40-Start

Once you open it, typically you’ll have a list of the items that can be configured.

2016-01-13 15_02_09-Photos

We want the SQL Server Network Configuration, which is the server level configuration for this host. The Client configurations are for the host being used a client to connect to a SQL Server.

2016-01-13 15_02_31-Photos

As you can see here, I have Shared Memory and TCP/IP enabled for this instance, but Named Pipes disabled.

Disabling a Protocol

As you might expect, this is easy. I right click on a protocol, and I can change the status. In this case, I’ll disable Shared Memory

2016-01-13 15_03_50-Photos

Once I do that, the protocol is disabled. However not on the instance. I’ll get this message.

2016-01-13 15_04_56-Photos

I need to restart the server. Once that’s done, no one will be able to use Shared Memory on the host.

I can fix this

2016-01-13 15_04_49-Photos

Of course, I need to restart my instance again.

Checking the Log

When SQL Server starts, quite a bit of configuration information is written into the log. This is useful for troubleshooting in many cases. One of the things you’ll find is the network listeners, as shown here.

2016-01-13 15_08_14-Log File Viewer - JollyGreenGiant_SQL2016

This is usually after the database startup information, so if you look, you can see I have some local pipes and some TCP/IP settings here.

SQLNewBlogger

After reading a question, this was less than 10 minutes to write, with making screenshots. However I’ve done this before. If this was your first time, then it might take you longer to research and write, but I bet most DBAs could do this in 30-45 minutes.

Tracking Logins with Extended Events

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as#SQLNewBloggers.

I was building a question the other day on Extended Events and needed to build a new session. One of the easier sessions to build is with the logins, so I decided to use that, and then wrote this post on how to build the session.

The first thing is to to to the Management tab in SSMS and then the Extended Events folder. I right click the Sessions folder and select New Session.

2015-11-18 10_40_06-Cortana

This gives you a dialog. Like many in SSMS, we start with a name. You could select run at server startup, which I’d do in the case of this being a production system.

2015-11-18 10_40_55-Photos

When I click next, I get to the "Events" tab, which lists all events. I’ll scroll down to login and select that. I need to click the arrow to the right.

2015-11-18 10_41_12-Photos

Once I do that, my event is in the session.

2015-11-18 10_41_19-Photos

After I pick the events, I choose the fields I’m going to capture. There is a "Configure" button in the upper right that you click. This scrolls the dialog over.

2015-11-18 10_41_38-Photos

I can select any number of fields for capture. You can see I’ve picked the client_hostname. I would also add the NT_Username and Username from the list. You could add more, but in this case, I’m more concerned with seeing who’s logged in.

I could add filters, but I choose not to. I click on Data Storage to determine where to store this data.

2015-11-18 10_45_59-Photos

For auditing, I might want a file. In this case, for testing, I’ll use the ring buffer, in memory storage.

2015-11-18 10_46_15-Photos

That’s it for getting the session set up. However it’s not started. To do that, I need to right click the session and select Start.

2015-11-18 10_47_34-Start

This will allow the server to start collecting data. Is it working? Let’s see. We’ll watch the data. Right click the session again and select Watch Live Data

2015-11-18 10_47_43-Cortana

This pops open a window. I usually make this a separate vertical tab group. Once that’s open, I’ll click "New Query" in SSMS, which will log me in again

2015-11-18 10_48_05-Photos

As you can see, a few events popped up here. I am capturing data. Don’t forget to stop the session after this if you don’t need it.

SQLNewBlogger

This post came out of work I was doing, and which I’d likely do as a DBA. However as soon as I got things working and tested, I knew this was a good post. In fact, I got a couple posts from the process. The setup and testing took about 20 minutes, including a little research. However the writing for this was about 10 minutes.

References

A few things I used.

The Writeable Warm Standby

I saw a question recently that went like this: I get one full backup from FTP. I’ll get daily log backups through FTP after this, but never another full. I need to restore this daily log backup and allow the group receiving the database to read/write the copy, and then reset it with the new log backup overnight.

First, this is untenable. At some point you’ll have some issue with transfer, lose a log, or the database will go corrupt. I can guanantee you that at some point you will need another full backup. Not every week, or even every month, but you will need one.

Second, this is a tough situation. I saw some answers, which I agreed with, but I started thinking about ways to get that data moved. My first thought it use STANDBY and move the data every day to a clean database. I’ve done this before, and in the GB range, even 100s of GB, this can work. It helps if you can whack indexes and constraints on the destination, but a copy of data table-by-table goes fast.

However then I thought about other ways. You can’t take a backup of a standby database, nor can you take a snapshot. However while searching, I saw an answer to this post on SO.

TL;DR: copy the mdf/ldf to a new database.

That was interesting, so I decided to test it. Turns out, it works pretty well. HOWEVER, it’s dangerous, and I think you should be very careful about this. I wouldn’t count on this being production stable, and certainly not data stable. You better have other copies of this data.

Here’s what I did. First, create a database with some data.

CREATE DATABASE mydb;
GO
USE mydb;
GO
CREATE TABLE mytable(id INT);
GO
INSERT mytable SELECT 1;
GO

Next, let’s back this up, take if offline, and then copy files.

USE master;
GO
BACKUP DATABASE mydb
 TO DISK = 'mydb.bak';
GO

Now we can copy the files to new files. There are UAC issues here, so you’ll need to add some rights to the files if you do this regularly. I left a comment in my script, but I actually did a CTRL+C,CTRL+V in the data folder, allowing the UAC permissions to work.

2015-12-11 08_42_22-Photos

Once that was done, I had new files:

2015-12-11 08_28_15-Photos

I tried to run at attach, but got a permissions error:

Msg 5120, Level 16, State 101, Line 30
Unable to open the physical file “D:\SQLServer\MSSQL11.MSSQLSERVER\MSSQL\DATA\mydb_reporting.mdf”. Operating system error 5: “5(Access is denied.)”.

The solution I found was to run SSMS as an administrator. Annoying, but it works.

At least for the OS error. However, then you get this:

Msg 1824, Level 16, State 1, Line 35 Cannot attach a database that was being restored.

You can’t do this. At least not easily.

You can do this. First, delete the files you copied over, then run this:

CREATE DATABASE mydb_reporting
go
alter database mydb_reporting set offline;

The next step is to delete the MDF and LDF files, which will be mydb_reporting.mdf and mydb_reporting_log.ldf  by default. I could specify other names, and would if this were something I needed to script.

Once those files were deleted, I’d next copy my files again and rename them. That would result in this:

  • mydb_reporting_base.mdf –> mydb_reporting.mdf
  • mydb_reporting_base_log.mdf –> mydb_reporting_log.ldf

Now I can go back to SSMS. In SSMS, I do a simple ALTER.

ALTER DATABASE MYDB_Reporting SET ONLINE

Then I can run this:

2015-12-11 09_03_28-Start

I have a copy of my database. Can I apply logs and move forward? Let’s try. First, let’s add data and make a log backup

USE mydb
GO
INSERT mytable SELECT 99
GO

BACKUP LOG mydb TO DISK = 'mydblog.trn'   
GO

Next we restore again. We also set the databases offline again.

RESTORE LOG mydb_reporting_base
 FROM DISK = 'D:\SQLServer\MSSQL11.MSSQLSERVER\MSSQL\Backup\mydblog.trn'
 WITH  MOVE N'mydb' TO N'D:\SQLServer\MSSQL11.MSSQLSERVER\MSSQL\DATA\mydb_reporting_base.mdf'
    ,  MOVE N'mydb_log' TO N'D:\SQLServer\MSSQL11.MSSQLSERVER\MSSQL\DATA\mydb_reporting_base_log.ldf'
    , STANDBY = 'D:\SQLServer\MSSQL11.MSSQLSERVER\MSSQL\Backup\undo.log';
GO
USE master
GO
ALTER DATABASE mydb_reporting_base
 SET OFFLINE;
GO
ALTER DATABASE mydb_reporting SET OFFLINE
GO

Once again it’s file copy time. The UAC comes into play again as I copy and rename the _base files. However once that’s done, things work.

2015-12-11 10_17_02-Photos

This works, but I am not recommending this as something you should do, especially for critical systems. This can work, but it’s dangerous, and really not supported by MS.