One of the core tenets of good software design is to abstract away details in any particular part of an application. We want to use well defined and constructed interfaces so that the implementation of any particular method or section can change without affecting the way the system works. This allows for improvement and upgrade over time. The same thing occurs in hardware, where we can replace a hard drive, a graphics card, or other components and the system should still function in the same manner. There might be a bit of work, such as updating a device driver, but the core system should still work.

This is also present in the real world. I can replace the wheels and tires on my car, since as long as I have the same pattern for the bolts to attach to the axle, things still work. Electrical systems work this way, allowing any device that has the correct plugs and uses the expected voltage to interface with an outlet. The examples of abstraction are numerous, and the more we use abstraction, the more flexible our systems can be. Where we haven’t abstracted away details, it becomes complex and expensive to change part of a system.

In a database setting, we want to use abstraction where possible. The use of views or stored procedures (or functions) allow the underlying table implementations to change without an application being too tightly coupled to the structure. This isn’t always well adhered to, despite the well known practice of building a data access layer into an application. Too often developers want to tightly couple their application to the underlying table structure.

But, how abstracted should you be? Certainly I’d hope that your application had the ability to easily change connection settings, and these days I’d hope you actually had two: a read/write connection and a read-only connection. What about the database name? Should that be abstracted away? Again, I’d hope so, even in a multi-database application. If for no other reason than to simplify development by allowing the database name to change on a development server. Certainly security objects, especially encryption mechanisms, need some abstraction to prevent the requirement that they exist in non-secure environments.

Are there other abstractions you’d want to see widely implemented? I wonder what other examples might be important to developers or DBAs out there. I know that allowing abstractions also brings complexity, and the ability to change those values between environments is critical. This could be performed with injection of different parameters as software changes are deployed, but the mechanisms for doing this are still immature and not standardized.

There are plenty of other places we can abstract away implementations, but without mentioning too many, I want to know what you’d like to see. What abstractions would you want, or do you implement in your systems?

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.7MB) podcast or subscribe to the feed at iTunes and LibSyn.

The Data Dewey Decimal System

As a small child, I learned about the Dewey Decimal System in school. Those were the days when you wanted to look up some information or check a fact, you needed to peruse the dead tree books in a library, with information organized by the Dewey Decimal System in the US. At least, all libraries I used in school adhered to this.

These days we usually use a computer of some sort for learning, research, or really most any work with data. Often I start with Google to find my way to the source of information, but that’s not something that necessarily works well with finding sets of data. It certainly doesn’t work well within a an organization.

I saw recently that Microsoft announced the general availability of the Azure Data Catalog, which is designed to provide a catalog of data sets. In essence the Data Catalog is an index of the data sets that might be produced by your organization, with the information about the data filled in by the producer of data. Users that are looking for data can query the catalog instead of asking coworkers, wandering through the enterprise databases, or even relying on their own memory of where data might be located.

At first this seems silly, after all, don’t people inside of an organization know where data is kept? Don’t they learn the servers, databases, and connection methods? Certainly many do, but as with the pace of change these days, as well as the rapidly growing number of ways to publish data these days, it’s entirely possible that many people aren’t aware of all the data sources available inside of an organization. Even at Redgate Software, with a few hundred employees, it is fairly difficult to keep track of what data exists in which location.

The functionality of the Data Catalog seems a bit basic, and really almost like an extension of adding extended properties to various tables. Certainly things are centralized here, which is good. There are also ways to add other sources, such as SSRS reports, files, and even other relational sources. I’ll have to experiment a bit and see what’s available, and I might encourage you to do the same. The product relies on crowdsourcing, which can go really well, or really poorly, depending on how cooperative your crowd is.

In any case, I do like the idea of having a central catalog that individuals can update as they produce data sources for others to consume and change what’s available. If it works well, with good searching and tagging, it might eliminate some of the redundant work often performed to surface data inside of any organization and let employees know how to find the answers to their questions.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.9MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

The End of 2015

I’m off on vacation, so I won’t be able to respond, but I decided to take a few minutes and look back at 2015 before leaving for the year.

This was a year that seems to have been marked by lots of data loss and breaches. Certainly these were on my mind as I worked on the SQL in the City keynote, but as I’ve been following Troy Hunt, it seems that every week or two he’s uploading more data to HaveIBeenPwned.com. We had some larger ones, with tens or hundreds of millions of account records released. Despite all the press from the Sony hack over a year ago, it seems that few companies have bothered to update their security. In fact, it seems that until they’re hacked, no one bothers to fix fundamental security issues in their systems. Perhaps some companies are doing so in the background, and aren’t being attacked, but I’m not so sure.

We didn’t have any SQL Server releases this year, but we had lots of news, and an unexpected Service Pack. Service Pack 3 for SQL Server 2012 appeared out of the blue for me. I hadn’t seen much news on it, and had given up on bothering Microsoft about new SPs for versions. Perhaps they’ll continue to build these yearly for supported versions, which is what I hope, but we will see. It does seem that Cumulative Updates have been appearing regularly, with relatively few issues in them, but I’m still wary of using those as the main patching mechanism for critical servers.

We did have a lot of growth in the SQL Server space, with many features being announced and starting to take shape. If you’re looking to get up to speed, check out our collection of Learning SQL Server 2016 topics, where we try to keep a growing collection of links to help you learn.  I am excited to see some of the growth of SQL Server to include newer features that people want in their data platform. I’m also glad that things like Stretch Database can be used to help manage the ever growing amount of data we have. Of course, encryption is big on my list, and Always Encrypted is something I am hoping gets lots of adoption.

We’ve also seen Microsoft really pushing the envelope in terms of data analysis. There is a constant set of articles and blogs written about data scientists, and some of us are moving to learn more about how to better analyze data. Microsoft continues to help, with their forays into Machine Learning, the expansion of the Power set of products (Power BI, Power Query, Power View, Power Pivot, etc.), R language integration, and more. I suspect that more and more of us will get the chance to play with some interesting data analysis tools if we want to. Even if you don’t use those to help your business, I have already seen these tools being used to perform index analysis, DBA monitoring, and more. I’d urge you to let your imagination run wild and see what things you might build.

It does also seem that more companies are starting to realize that their data and its integrity and management are important. Data professionals are becoming more valued, but the skills required are growing. There are more and more services and tools to help us manage systems that I do think the simple DBA that administers backups and security is really on the decline. At some point, our employers will demand more.

It’s been a good year, and I look forward to 2016. If there are things you want us to help you learn about, leave a comment here, and I’ll review them when I get back from vacation. Have a safe, Happy New Year, and I’ll see you in 2016.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

An Out of Date CAP

I was reading this piece from Martin Kleppmann asking us to stop using the CAP theorem to talk about distributed systems. It’s of interest to me because we often see this applied to databases, and it seems I’ve seen database platforms described as fitting along one of the sides of the CAP triangle.

However the complaint is good, although tough, reading. As I read through it, a number of the concepts and tradeoffs, and concerns are similar to what I see discussed in Azure systems, or any “cloud-like” distributed systems. We expect consistency and availability, as well as scaling through partitions. However, it seems that we are always trading consistency (what Keppmann notes is really linearizability) and availability in the data world. We simply can’t guarantee both of these things in a data world.

Or can we? At some point, perfection isn’t required, and we can tolerate some level of inconsistency in the results clients might receive. I’m sure many of you have dealt with this in the past, and perhaps even gone so far as to add the report execution time to outputs to reduce client complaints. If two sets of data are compiled at different times, most clients tend to understand some discrepancies.

Certainly many of us are starting to consider using database platforms that might not work like SQL Server. In those cases, availability and scale is often touted as the reason to abandon a RDBMS. On one hand, the lack of linearizability across nodes is often tolerable, and many of our businesses aren’t affected dramatically by them. On the other hand, if you can tolerate some delays in nodes, than perhaps SQL Server can work for you with multiple nodes in a replicated or AlwaysOn scenario.

Distributed systems are hard, and certainly your decision shouldn’t be a simple one when it comes to choosing an architecture. However it also helps to think deeply about the problem in terms of not only the CAP theorem, but also the in terms of practical measures such as latency and practical failures.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.6MB) podcast or subscribe to the feed at iTunes and LibSyn.

Serious Hacking

There’s a piece that calls the US Office of Personal Management (OPM) data breach the biggest government hack ever. It might be, but give us a few months and I’m sure something will be bigger. After all, we constantly acquire more data, so the next breech is likely to contain more information. I’m also not sure most of us are actually getting much better at security.

There were a few notes about this that would apply to every company I’ve worked in. Such as the OPM not having a comprehensive list of devices and databases. I’m not sure any company does, and having worked with people that run SCOM-type systems, it’s a hard problem to solve. This doesn’t even cover the problems of Excel. Access, and soon, PowerBI data being scattered across systems.

However there was one problem I think we could fundamentally improve in most companies. The article noted that OPM didn’t have control over how it’s systems were configured, meaning an attacker could reconfigure things. Far, far too many companies allow a (too) large group of people to deploy changes to servers. Even when larger companies limit rights for developers, I’ve too often seen operations staff log in and allow developers to change systems to get them working.

As an industry, we really need to solidify and build better systems for ensuring the security of our hardware and software and preventing, or detecting, unauthorized changes. Certainly there will always be social engineering and other techniques that bypass security, but we should be able to prevent malicious changes to systems with solid architectures from our vendors/FOSS developers. We should also decide upon, and be sure, that our staff learn, understand, and follow, best practices.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.4MB) podcast or subscribe to the feed at iTunes and LibSyn.

The New DBA

What does DBA mean? It’s traditionally meant Database Administrator for technologists. However it’s also been a synonym for a database developer, or even anyone that has raised privileges on a database system. The term has been expanded to people designing databases, and usually includes the BI developers or analysts that work with database related technologies.

In short, it’s been a vague term that means different things to different people. However as I look forward, I think there’s another interpretation I have. The people functioning as DBAs, are the Data Bridges to Applications. We take the data and find ways to ensure it’s useful to applications. We make sure it’s available. We connect systems together. We do the dirty work of cleaning data. We secure the bridge.

We really are the person that needs to answer any question about the database. Throughout my career, when a company has decided to use DBAs, I have found the DBA is usually the catch-all person that really needs to be able to do whatever is asked of them that might tangentially relate to a database.

While there have always been DBAs that tightly controlled their systems and acted as impediments to change, I think those positions are disappearing fast. More and more companies, if they even bother to hire DBAs, want those individuals to ensure data is useful. Whether that’s administering, developing, configuring, or anything else related to data, whether the data is in an RDBMS, Access, Excel, the cloud, NoSQL, any anywhere else, we need to be able to provide some solution to gets data to an application.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.1MB) podcast or subscribe to the feed at iTunes and LibSyn.

The Rename Game

As I’ve been talking about database development, Continuous Integration, and Continuous Deployment, I keep hearing complaints and concerns about the problems that occur when we make database changes. In particular, I often hear about breaking changes made to databases, that require some manual scripting work.

It seems that the most common breaking change is adding a not null column to a table with existing data, but certainly there are other problematic issues. The other common change is renaming objects. In the past, I’ve rarely renamed anything once I’ve deployed tables to production, but I’ve run into quite a few developers that note they regularly change names. This week I’m wondering how common this is for most of the data professionals out there.

How often do you rename objects?

It could be tables, views, columns, constraints, whatever. If you rename anything in your database, I’m curious if it’s a regular event. Is this an event that causes you issues with your software development process, let us know. I’d also be curious with how you manage tracking and deploying the code from development through to production when you are making these types of breaking changes. Are things still scripted and able to be automated or do you force manual steps.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

When Should We Be Cleaning Data?

I was reading Grant Fritchey’s update from the Business Analyst Conference  and noticed this quote: ” There were lots of sessions on the need to clean data, which implies that we need to do a better job at data collection in order to support this wild new world.” Grant was talking about the fact that many of the sessions seemed to imply that it’s mostly the processes around data that cause issues with data, not the people.

However is that really what we should do? Do we want to do more and more cleansing at the point where people enter data? I’m not so sure that’s the case. The more that I deal with various applications and forms, the less I want to see too many fields and values required. Certainly there are pieces of data that are necessary for action, but I often find there are lots of additional fields that analysts want, but are more of a burden than a necessity.

Most of us as data professionals design tables to handle the needs of an application. We put fields in tables that we expect to fill with data, based on a specification. However the real world is messy, and often the data we want to collect and store isn’t always available. The question is, do we force failures in the application or do we handle missing data?

I don’t want to eliminate DRI, or go to an EAV models for lots of data. However I do think that we need to ensure we allow defaults and work with developers to allow data in our systems that might not be complete now, but perhaps will be in the future. We should work with analysts to help them build reports that can handle defaults or missing fields. We can work with developers to allow applications to request updates to data later and then design ETL that can efficiently fill in the updated information.

Applications and software need to be flexible to work with problematic data. We have the ability, as data professionals, to help our clients still find meaning in data that might not be as complete as we’d prefer, or they’d prefer. However we can still find valuable information in what data they have..

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.5MB) podcast or subscribe to the feed at iTunes and LibSyn.

QA for Data

Here’s a typical DBA scenario. You’re at your desk, working on something, when you get a trouble ticket, phone call, or tap on your shoulder. There’s a problem in production and an order, or some other important piece of data, is incorrect. Can you fix it?

Many of us are happy to oblige. Often we know what needs to be changed and can write the INSERT/UPDATE/DELETE statement to correct the issue. Even if we have to dig in further, we can usually find the data in a table and determine what should be changed.

However, what do we do then?

If you are asked the make data changes to production, do you QA the script? Do you save the scripts in a VCS or a log? Does anyone else examine the data or code? Do you scan the script yourself, have SQL Server parse it, or bother to look for triggers that might fire? Do you bother to put the script in a transaction with error handling?

I have to say that in most of my career, I didn’t bother with any of that. I’d examine a script, perhaps have another DBA or developer look at it, but most of the time I’d run it in production if I thought it was ready. I did often wrap the code in a transaction, with a query to check the results, and an immediate rollback to prevent blocking. However if I thought the results looked good, I’d rerun the query without the transaction. Most of the time that worked well. A few triggers and fat fingers caused issues that I had to fix again, sometimes after quite a bit of work to correct my mistakes.

It’s hard to prevent data change errors, especially if it’s something we do often, or we work in a small environment with limited staff. Ultimately, we might not even care because if we update the data incorrectly, we can run another update to fix the new issue. That’s not necessarily what most managers would want to hear, but it’s reality.

The one thing I have insisted upon in my environments was logging any queries run. A simple cut and paste with a timestamp. That way if things were done incorrectly, at least I know what was done.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.6MB) podcast or subscribe to the feed at iTunes and LibSyn.

Deploy the Database First

One of the patterns I’ve seen in some environments is people are trying to deploy changes rapidly to their database backed applications. That’s not news, but what is interesting is some of them are staging the deployment of the database changes first. Not as in I deploy database changes at 8:00pm and then application changes at 8:30pm. These people try to deploy the database changes on Monday, and their application changes will follow on Tuesday, Wednesday, or even a month later.

It’s an interesting concept, though I think it requires a lot of forethought in your designs, as well as very tight coding from your front ends that won’t be disturbed by extra columns in a result set.  That’s not easy to do, but it’s certainly possible, and it can even be useful if you deploy feature flags extensively in your application.

As we become more dependent on databases for our applications, and our customers expect systems to be running constantly, I think it behooves us to find ways to make alter and enhance our applications without downtime. While there are patterns to keep applications running when the database goes down, I expect that the reality is that we need to find ways keep the database up as we alter it, which for me means making non-breaking changes.

I think it’s possible in many cases to upgrade a database over time by carefully planning your schema changes and accounting for those changes in your front end architecture. However it’s not easy, as I mentioned, and you do need to commit to very stable and careful programming practices for your developers.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.9MB) podcast or subscribe to the feed at iTunes and LibSyn.