We had a disaster at SQLServerCentral this past weekend. It wasn’t a big disaster, but it was an event that required a restore of data.
An Administrative Error
On Saturday, I was attending SQL Saturday #331 in Denver. In between my sessions, I was prepping a few demos and finished getting ready earlier than expected. Since I had a few minutes, I checked my email and immediately knew we had an issue on the site.
We’ve been fighting spam for months, slowly tweaking our posting system in the forums. With the start of the American football season, we’ve been getting hundreds of posts every Friday and Saturday. I’ve tried to ensure these posts are removed before our newsletters are generated so that they aren’t filled with advertisements, but it’s been a chore.
One of the things I can do in the forums is select a series of posts and mass delete (or open, close, hide, etc) the group. For most of the SPAM posts we receive, the posts all occur in the same few minutes and are grouped together. I’ve gotten in the habit of deleting these batches of posts, watching for a legitimate SQL Server post at the end.
However on this day, one of the our regular threads was buried in the middle of all the SPAM posts. This was THE Thread, the most active and long lived discussion on the site with 45k posts. I inadvertently deleted the post and went on to give my presentation.
Afterwards, I got a private message from the site, telling me the discussion had been deleted around 2:00pm MST.
The first thing I did when realizing what had happened was connect to our production database cluster through VPN. When I opened Management Studio, I ran a few queries to verify the discussion had been deleted, and not just “marked for deletion”. Logical deletes exist in many applications, and if this is the case in your own disaster, the last thing you want to do is initiate a database restore.
In this case, the data was gone, so I immediately tried to initiate a restore. Since over an hour had passed, I didn’t want to restore over top of the current database. Instead, I wanted to restore a copy as a new database, as of 1:45pm or so.
I selected the proper options, marking the full backup from overnight and the log backups throughout the day. I didn’t have time to worry about using STOPAT and trying to get close to the actual time of data modification, so I choose the last backup I knew would be good. Verifying the database name was a new name, I clicked OK.
And nothing happened.
Actually, that’s not true as I got an error. The backup system in use by Rackspace, our provider, doesn’t keep the files available from SSMS. I don’t have rights to work within the restore system, or even request one from Rackspace, so I opened a ticket with Red Gates support for a restore.
Had this been a situation where the site was down or users were unable to read articles or post, I would have escalated this for immediate action. However since this was a restore of a single thread, and one that exists for entertainment more than education, I chose not to bother our IT staff on a Saturday night or Sunday.
When I woke up Monday morning, I had a message that the restore had been completed to my specified new database (SSCForumsOld) as of 1:30pm MST on Saturday. I hadn’t asked anyone to do more than this, so this was the extent of actions taken by Red Gate.
Again, I could have specified actions in more detail, but rather than try to explain to someone in email which thread, and which posts needed to be restored, I decided to handle this myself. After taking my children to school, I sat down and got to work.
I’ve known the PK of this particular thread since I’ve had to work with in in the past. Connecting from Management Studio to the production instance, I verified I could see the 45k messages in the SSCForumsOld database. I ran the same query on the SQLServerCentralForums database, and validated the data was still missing. I then built a query that would perform an INSERT..SELECT of the parent posts from the restored database to the production system. This took longer than expected, with the table having a number of locks for about 2 minutes. However the post details had been moved.
That left me with the need to move the actual words of each post, which are stored in a separate table (for some strange reason). Rather than lock up the forums for minutes, I spent time rewriting my next insert to use batches of 1000, and only insert those messages which had not already been moved. Since I could join on PKs, this went quickly, in a few seconds. I next changed my batch size to 5000, and this completed in about 15s.
That seemed like enough time to run quickly, but also a good batch of data and manually executed this 9 times to move all the data. A quick check on the site showed THE Thread was back and I posted a few notes to let users know.
There are still a few issues with points for users that have posted to this topic being incorrect, but that is a lower priority item and I am letting our developers look at it. There is at least one known bug with points, and it’s possible we have another here.
My personal lesson learned was that I need to move a little more slowly when removing SPAM. I also don’t want to trust myself to do it regularly, so I also spent part of Sunday morning writing a little code and scheduling a job to delete posts with certain patterns of titles that the spammers use. I tried to limit to those obvious subjects so that no legitimate posts are removed.
We are also escalating some of the issues with SPAM, and with the return of my manager from sabbatical, I’m hoping we can build a few more filters to limit the disruptions in the forums.
And of course, everyone that posts to THE Thread has my apologies for the mistake.