Where’s the Unit Testing?

I’ve been a proponent of Unit Testing, especially for databases. I’ve given presentations on the topic and advocate the use of techniques to verify your code works, especially over time when the complexity grows, new developers change code, and potentially introduce regressions. I’m not the only one as I saw a question recently from Ben Taylor asking where has unit testing gone?

I was disappointed that few people have responded to the piece, and I think this is the same response that unit testing in front end application software received a decade or two ago. Few people saw value in testing, preferring to assume developers will code well. Over time, and with some investment, quite a few people have seen the value of unit testing, though I’m not sure it’s the majority yet. In building database software, we’re still woefully behind, preferring to use ad hoc tests that are subject to human frailty (forgetfulness, making mistakes in running tests or not examining results closely).

I do know a few people that are customers of Redgate and use unit testing extensively in their database code. They definitely spend a lot of effort building unit tests, often having more test code than feature code, but they also have very low rates of complaints and bugs from users. I hope that more people having success will publish details on their unit testing successes and failures, and I’d welcome more pieces at SQLServerCentral on either side of the issue.

For many people writing in-house applications, especially those installed in one location, perhaps a few bugs aren’t a problem. Maybe the impact is low enough that training developers to write tests and making the investment isn’t valuable enough.  However, for those that have disparate external clients, or maybe install software in many locations, I bet that moving to a more thorough set of repeatable, reliable, non-trivial tests will improve your software quality.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.2MB) podcast or subscribe to the feed at iTunes and Mevio .

Going On-Prem?

When I started in this business, every company owned their own servers, and had their own data center. Now, some of those data centers were closets (including that for one state government Senate). Some were offices-converted-to-computer rooms, rarely with separate air conditioning until I requested it. Some were actual data center rooms, though usually for larger companies. The idea of putting your computing resources off-site was seen as ludicrous.

As time passed, the idea of using a co-location facility grew. In fact, SQLServerCentral went from a couple machines in a residential basement to a locked rack at a facility in Denver. Many companies started to use third party data centers, and in the late 90s and early 2000s, this was common, with facilities springing up. Full time employees will often racked equipment and visited data centers when they needed physical access.

This changed over time as virtualization grew, and more and more companies started to rent a VM, rather than purchase a physical box. Some still used physical machines, but they rented the machine, allowing a third party to setup and mount the equipment, sometimes even installing the OS and then providing access to the lessee. We’ve come to the place now where we have cloud services that are often just platforms or services, where we have no idea of the underlying equipment. In the case of things like Salesforce, Azure SQL Databases, or Amazon’s RDS, we don’t even care, just asking for a certain level of performance.

Is this the future? Will we now be asking ourselves if we “want to go on-prem” as the default question? I ran across the blog linked, which is from a SaaS perspective, that is trying to always get customers to use the service (or platform) and never have a local install. I know Microsoft would like to do this, especially for Azure databases as they make more profit on compute services.

Some of you think this is crazy. In fact, I know plenty of industries that struggle with this from a regulatory standpoint. However, I’d point out that the idea of letting someone else run your email system was seen as crazy 20 years ago. Today many people (myself included) would never think to install Exchange or any other email software. We would always purchase this service from someone else.

Tim Mitchell had a podcast interview recently talking about whether the on-premise data warehouse is dead. While I think the Azure Data Warehouse has some great advantages, and is worth considering, I’m not sure the local DW is dead. In fact, for lots of tasks I perform, including software Continuous Integration builds (.NET or database), I prefer to have some local resource doing the work. Especially when I can’t predict the number of times I’ll build and that task is easy to scale on local machines.

I don’t think that we are at the point where we’d put more databases in a cloud service than on premise, but I do think that for new applications, it’s a valid question to ask whether the database and application could be hosted in the cloud.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.7MB) podcast or subscribe to the feed at iTunes and Mevio .

Abstraction

One of the core tenets of good software design is to abstract away details in any particular part of an application. We want to use well defined and constructed interfaces so that the implementation of any particular method or section can change without affecting the way the system works. This allows for improvement and upgrade over time. The same thing occurs in hardware, where we can replace a hard drive, a graphics card, or other components and the system should still function in the same manner. There might be a bit of work, such as updating a device driver, but the core system should still work.

This is also present in the real world. I can replace the wheels and tires on my car, since as long as I have the same pattern for the bolts to attach to the axle, things still work. Electrical systems work this way, allowing any device that has the correct plugs and uses the expected voltage to interface with an outlet. The examples of abstraction are numerous, and the more we use abstraction, the more flexible our systems can be. Where we haven’t abstracted away details, it becomes complex and expensive to change part of a system.

In a database setting, we want to use abstraction where possible. The use of views or stored procedures (or functions) allow the underlying table implementations to change without an application being too tightly coupled to the structure. This isn’t always well adhered to, despite the well known practice of building a data access layer into an application. Too often developers want to tightly couple their application to the underlying table structure.

But, how abstracted should you be? Certainly I’d hope that your application had the ability to easily change connection settings, and these days I’d hope you actually had two: a read/write connection and a read-only connection. What about the database name? Should that be abstracted away? Again, I’d hope so, even in a multi-database application. If for no other reason than to simplify development by allowing the database name to change on a development server. Certainly security objects, especially encryption mechanisms, need some abstraction to prevent the requirement that they exist in non-secure environments.

Are there other abstractions you’d want to see widely implemented? I wonder what other examples might be important to developers or DBAs out there. I know that allowing abstractions also brings complexity, and the ability to change those values between environments is critical. This could be performed with injection of different parameters as software changes are deployed, but the mechanisms for doing this are still immature and not standardized.

There are plenty of other places we can abstract away implementations, but without mentioning too many, I want to know what you’d like to see. What abstractions would you want, or do you implement in your systems?

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.7MB) podcast or subscribe to the feed at iTunes and LibSyn.

The Data Dewey Decimal System

As a small child, I learned about the Dewey Decimal System in school. Those were the days when you wanted to look up some information or check a fact, you needed to peruse the dead tree books in a library, with information organized by the Dewey Decimal System in the US. At least, all libraries I used in school adhered to this.

These days we usually use a computer of some sort for learning, research, or really most any work with data. Often I start with Google to find my way to the source of information, but that’s not something that necessarily works well with finding sets of data. It certainly doesn’t work well within a an organization.

I saw recently that Microsoft announced the general availability of the Azure Data Catalog, which is designed to provide a catalog of data sets. In essence the Data Catalog is an index of the data sets that might be produced by your organization, with the information about the data filled in by the producer of data. Users that are looking for data can query the catalog instead of asking coworkers, wandering through the enterprise databases, or even relying on their own memory of where data might be located.

At first this seems silly, after all, don’t people inside of an organization know where data is kept? Don’t they learn the servers, databases, and connection methods? Certainly many do, but as with the pace of change these days, as well as the rapidly growing number of ways to publish data these days, it’s entirely possible that many people aren’t aware of all the data sources available inside of an organization. Even at Redgate Software, with a few hundred employees, it is fairly difficult to keep track of what data exists in which location.

The functionality of the Data Catalog seems a bit basic, and really almost like an extension of adding extended properties to various tables. Certainly things are centralized here, which is good. There are also ways to add other sources, such as SSRS reports, files, and even other relational sources. I’ll have to experiment a bit and see what’s available, and I might encourage you to do the same. The product relies on crowdsourcing, which can go really well, or really poorly, depending on how cooperative your crowd is.

In any case, I do like the idea of having a central catalog that individuals can update as they produce data sources for others to consume and change what’s available. If it works well, with good searching and tagging, it might eliminate some of the redundant work often performed to surface data inside of any organization and let employees know how to find the answers to their questions.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.9MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

The End of 2015

I’m off on vacation, so I won’t be able to respond, but I decided to take a few minutes and look back at 2015 before leaving for the year.

This was a year that seems to have been marked by lots of data loss and breaches. Certainly these were on my mind as I worked on the SQL in the City keynote, but as I’ve been following Troy Hunt, it seems that every week or two he’s uploading more data to HaveIBeenPwned.com. We had some larger ones, with tens or hundreds of millions of account records released. Despite all the press from the Sony hack over a year ago, it seems that few companies have bothered to update their security. In fact, it seems that until they’re hacked, no one bothers to fix fundamental security issues in their systems. Perhaps some companies are doing so in the background, and aren’t being attacked, but I’m not so sure.

We didn’t have any SQL Server releases this year, but we had lots of news, and an unexpected Service Pack. Service Pack 3 for SQL Server 2012 appeared out of the blue for me. I hadn’t seen much news on it, and had given up on bothering Microsoft about new SPs for versions. Perhaps they’ll continue to build these yearly for supported versions, which is what I hope, but we will see. It does seem that Cumulative Updates have been appearing regularly, with relatively few issues in them, but I’m still wary of using those as the main patching mechanism for critical servers.

We did have a lot of growth in the SQL Server space, with many features being announced and starting to take shape. If you’re looking to get up to speed, check out our collection of Learning SQL Server 2016 topics, where we try to keep a growing collection of links to help you learn.  I am excited to see some of the growth of SQL Server to include newer features that people want in their data platform. I’m also glad that things like Stretch Database can be used to help manage the ever growing amount of data we have. Of course, encryption is big on my list, and Always Encrypted is something I am hoping gets lots of adoption.

We’ve also seen Microsoft really pushing the envelope in terms of data analysis. There is a constant set of articles and blogs written about data scientists, and some of us are moving to learn more about how to better analyze data. Microsoft continues to help, with their forays into Machine Learning, the expansion of the Power set of products (Power BI, Power Query, Power View, Power Pivot, etc.), R language integration, and more. I suspect that more and more of us will get the chance to play with some interesting data analysis tools if we want to. Even if you don’t use those to help your business, I have already seen these tools being used to perform index analysis, DBA monitoring, and more. I’d urge you to let your imagination run wild and see what things you might build.

It does also seem that more companies are starting to realize that their data and its integrity and management are important. Data professionals are becoming more valued, but the skills required are growing. There are more and more services and tools to help us manage systems that I do think the simple DBA that administers backups and security is really on the decline. At some point, our employers will demand more.

It’s been a good year, and I look forward to 2016. If there are things you want us to help you learn about, leave a comment here, and I’ll review them when I get back from vacation. Have a safe, Happy New Year, and I’ll see you in 2016.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

An Out of Date CAP

I was reading this piece from Martin Kleppmann asking us to stop using the CAP theorem to talk about distributed systems. It’s of interest to me because we often see this applied to databases, and it seems I’ve seen database platforms described as fitting along one of the sides of the CAP triangle.

However the complaint is good, although tough, reading. As I read through it, a number of the concepts and tradeoffs, and concerns are similar to what I see discussed in Azure systems, or any “cloud-like” distributed systems. We expect consistency and availability, as well as scaling through partitions. However, it seems that we are always trading consistency (what Keppmann notes is really linearizability) and availability in the data world. We simply can’t guarantee both of these things in a data world.

Or can we? At some point, perfection isn’t required, and we can tolerate some level of inconsistency in the results clients might receive. I’m sure many of you have dealt with this in the past, and perhaps even gone so far as to add the report execution time to outputs to reduce client complaints. If two sets of data are compiled at different times, most clients tend to understand some discrepancies.

Certainly many of us are starting to consider using database platforms that might not work like SQL Server. In those cases, availability and scale is often touted as the reason to abandon a RDBMS. On one hand, the lack of linearizability across nodes is often tolerable, and many of our businesses aren’t affected dramatically by them. On the other hand, if you can tolerate some delays in nodes, than perhaps SQL Server can work for you with multiple nodes in a replicated or AlwaysOn scenario.

Distributed systems are hard, and certainly your decision shouldn’t be a simple one when it comes to choosing an architecture. However it also helps to think deeply about the problem in terms of not only the CAP theorem, but also the in terms of practical measures such as latency and practical failures.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.6MB) podcast or subscribe to the feed at iTunes and LibSyn.

Serious Hacking

There’s a piece that calls the US Office of Personal Management (OPM) data breach the biggest government hack ever. It might be, but give us a few months and I’m sure something will be bigger. After all, we constantly acquire more data, so the next breech is likely to contain more information. I’m also not sure most of us are actually getting much better at security.

There were a few notes about this that would apply to every company I’ve worked in. Such as the OPM not having a comprehensive list of devices and databases. I’m not sure any company does, and having worked with people that run SCOM-type systems, it’s a hard problem to solve. This doesn’t even cover the problems of Excel. Access, and soon, PowerBI data being scattered across systems.

However there was one problem I think we could fundamentally improve in most companies. The article noted that OPM didn’t have control over how it’s systems were configured, meaning an attacker could reconfigure things. Far, far too many companies allow a (too) large group of people to deploy changes to servers. Even when larger companies limit rights for developers, I’ve too often seen operations staff log in and allow developers to change systems to get them working.

As an industry, we really need to solidify and build better systems for ensuring the security of our hardware and software and preventing, or detecting, unauthorized changes. Certainly there will always be social engineering and other techniques that bypass security, but we should be able to prevent malicious changes to systems with solid architectures from our vendors/FOSS developers. We should also decide upon, and be sure, that our staff learn, understand, and follow, best practices.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.4MB) podcast or subscribe to the feed at iTunes and LibSyn.

The New DBA

What does DBA mean? It’s traditionally meant Database Administrator for technologists. However it’s also been a synonym for a database developer, or even anyone that has raised privileges on a database system. The term has been expanded to people designing databases, and usually includes the BI developers or analysts that work with database related technologies.

In short, it’s been a vague term that means different things to different people. However as I look forward, I think there’s another interpretation I have. The people functioning as DBAs, are the Data Bridges to Applications. We take the data and find ways to ensure it’s useful to applications. We make sure it’s available. We connect systems together. We do the dirty work of cleaning data. We secure the bridge.

We really are the person that needs to answer any question about the database. Throughout my career, when a company has decided to use DBAs, I have found the DBA is usually the catch-all person that really needs to be able to do whatever is asked of them that might tangentially relate to a database.

While there have always been DBAs that tightly controlled their systems and acted as impediments to change, I think those positions are disappearing fast. More and more companies, if they even bother to hire DBAs, want those individuals to ensure data is useful. Whether that’s administering, developing, configuring, or anything else related to data, whether the data is in an RDBMS, Access, Excel, the cloud, NoSQL, any anywhere else, we need to be able to provide some solution to gets data to an application.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.1MB) podcast or subscribe to the feed at iTunes and LibSyn.

The Rename Game

As I’ve been talking about database development, Continuous Integration, and Continuous Deployment, I keep hearing complaints and concerns about the problems that occur when we make database changes. In particular, I often hear about breaking changes made to databases, that require some manual scripting work.

It seems that the most common breaking change is adding a not null column to a table with existing data, but certainly there are other problematic issues. The other common change is renaming objects. In the past, I’ve rarely renamed anything once I’ve deployed tables to production, but I’ve run into quite a few developers that note they regularly change names. This week I’m wondering how common this is for most of the data professionals out there.

How often do you rename objects?

It could be tables, views, columns, constraints, whatever. If you rename anything in your database, I’m curious if it’s a regular event. Is this an event that causes you issues with your software development process, let us know. I’d also be curious with how you manage tracking and deploying the code from development through to production when you are making these types of breaking changes. Are things still scripted and able to be automated or do you force manual steps.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

When Should We Be Cleaning Data?

I was reading Grant Fritchey’s update from the Business Analyst Conference  and noticed this quote: ” There were lots of sessions on the need to clean data, which implies that we need to do a better job at data collection in order to support this wild new world.” Grant was talking about the fact that many of the sessions seemed to imply that it’s mostly the processes around data that cause issues with data, not the people.

However is that really what we should do? Do we want to do more and more cleansing at the point where people enter data? I’m not so sure that’s the case. The more that I deal with various applications and forms, the less I want to see too many fields and values required. Certainly there are pieces of data that are necessary for action, but I often find there are lots of additional fields that analysts want, but are more of a burden than a necessity.

Most of us as data professionals design tables to handle the needs of an application. We put fields in tables that we expect to fill with data, based on a specification. However the real world is messy, and often the data we want to collect and store isn’t always available. The question is, do we force failures in the application or do we handle missing data?

I don’t want to eliminate DRI, or go to an EAV models for lots of data. However I do think that we need to ensure we allow defaults and work with developers to allow data in our systems that might not be complete now, but perhaps will be in the future. We should work with analysts to help them build reports that can handle defaults or missing fields. We can work with developers to allow applications to request updates to data later and then design ETL that can efficiently fill in the updated information.

Applications and software need to be flexible to work with problematic data. We have the ability, as data professionals, to help our clients still find meaning in data that might not be as complete as we’d prefer, or they’d prefer. However we can still find valuable information in what data they have..

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.5MB) podcast or subscribe to the feed at iTunes and LibSyn.