There has been quite a debate the last few years between using NoSQL and SQL Server as data stores for different applications. We’ve seen a great deal of interest from developers in document stores as the data structures are closer to what a C# or Java developer sees in their code. Graph databases are becoming popular for certain types of data as well, since their query performance can be extremely powerful in some cases. There are also the scalability aspects of NoSQL systems touted as a better way to build Internet based applications.
I was looking through the news this week and found a piece by Michael J. Swart that I really enjoyed. It talks about different types of metrics that he has used when monitoring and alerting on his database servers. Michaal talks about the difference between alarms and gauges, which are usually base on the time scale in which a DBA or administrator needs to react to a situation. Alarms are created when something grows in a short term and causes an issue. The gauges tend to be longer term items that eventually cause a problem.
I’ve had a similar view in the past. However I look at these types of metrics from the standpoint of how I tackle my job as a DBA. I also try to think about these issues as a developer, ensuring that telemetry helps track issues that occur over time and acute issues cause a graceful degredation (or exit) and not an unexpected one.
When I’ve needed to manage systems, I ensure that I have alarms set up. I need these for the situations where something goes wrong in a short period of time. However I never want alarms to be based on a lack of visibility into the system over time. I want to try and eliminate as many alarms as possible with better long term monitoring that looks for indicators that the system might have issues. I’ve found that many (not all) of the alarms that I receive could have been prevented if I’d been paying close attention.
In other words, I’m a proactive DBA, looking at gauges (using MIchael’s terms), not a reactive one counting on alarms.
I can’t predict everything, but when I find that looking back at data is helpful. If I find a trend that could have alerted me, I’ll evolve my monitoring system to try and track those potential issues in the future. I’ve written my own custom metrics in the past, which track data, either PerfMon type counters or business data, and then let me know when the growth, decline, or change is unexpected. In this way, I tend to find I can often proactively prevent issues. If I can use my data to “predict” an issue my boss doesn’t believe in, I’ve also found my budget for resources may increase a bit, allowing me to manage systems even better in the future.
It wasn’t that long ago in my career when a survey showed the vast majority of SQL Server databases were < 10GB in size. These days I’m sure the average has risen as there are some very, very large SQL Server databases. I know a few people managing PB sized systems, which certainly will move the average upa bit. It’s likely, however, that the majority of databases are still very small, powering smaller applications with < 100GB.
However almost every databases continues to grow, and as systems grow, new challenges come into play. While we can still store a 100GB backup file on a thumb drive or SD card, it’s still a significant amount of data to write, which can take more time than our clients would like. Archivals, backups, restore, all tend to come from non-flash media, so larger systems certainly mean longer RTOs if we have issues, even if we’re just restoring a copy of a database to extract some deleted information.
However there can be more fundamental problems. I ran across a post from the CSS Engineers at Microsoft that talks about exhausting NTFS resources as more and more writes occur on a filesystem. I’d never heard of this, or expected it to be an issue, but apparently it’s occurring more often. I suspect even the NTFS developers hadn’t considered the scales to which we probably stress the file system daily. What is interesting is that this can occur with DBCC operations, database snapshots, and even bcp operations. There are workarounds, and certainly I’m sure that Microsoft has people working hard to increase the capacities of ReFS, which I expect will take over from NTFS, just as NTFS has largely displaced FAT.
As we grow to larger and larger scales of data sizes, with multi-terabytes on laptops and petabytes on servers, I think we’ll need to adapt a number of our techniques and operations in the future. We might end up needing different types of backups (anyone considering lookup table backups or daily backups of a set of rows, like orders?), as well as the possibility of continuing to spread the load of our server across multiple drives. I just hope that Microsoft recognizes this and looks to ensure mount points work smoothly across all of their technologies and features. At some point, 26 drive letters won’t be enough for the majority of us.
How often have you been caught off guard by some issue with one of your databases? I would guess most people have received a phone call, email, or trouble ticket that alerts them to the fact that something has gone wrong. A few of you might have even had an irate customer or manager wander over to your desk and loudly inform you of an issue. Those are not great times, and most of us would prefer to find out about problems before others bring them to our attention.
This past week was the 66th T-SQL Tuesday blog party. If you haven’t heard of it, I would encourage you to look each month for the announcement on Twitter and participate. It’s a great chance to share some knowledge, and showcase your own experiences or viewpoint on a particular topic. I’d actually encourage all of you to go back and write posts about the past topics at some point. It’s a great way to challenge yourself to document your own experiences in a concrete way.
The theme this month was monitoring (thanks to Catherine Wilhelmsen b | t), and this is exactly the way that you build awareness in your environment. Monitoring allows you to ensure that you, as the DBA, learn about issues, or potential issues, first. Monitoring doesn’t solve your problems; that’s up to you to do, but it can help you proactively detect issues early and perhaps mitigage them before users notice them.
The blog entries this month cover all sorts of issues that you might think about watching for. The 825 errors in your log, metadata about your backups, jobs, and more. There’s a variety of posts that cover all sorts of home grown systems, as well as noting that third party products work well. You have a huge variety of choices, from building your own tools and processes or purchasing something like SQL Monitor from Red Gate. However no matter what your budget, resources, time, desire, and more, I’d highly recommend that you really think about ensuring that you are monitoring your systems in a way that lets you know how things are working.
What to do when they’re working poorly, that’s up to you. Hopefully the information we publish at SQLServerCentral, and the links we send out every week from Database Weekly, can help you learn to be better at your job.
One of the problems I’ve seen for years in computing is people seeing the forest or the trees. Not that either is good or bad, but far too often I’ve run across people that only see one of these, meaning they examine problems from a high level perspective, or a very low level perspective. While either of those can help you solve your issues, there will be a problem or a time when you need to examine the situation from another perspective. Knowing when to look at a higher or lower level can be invaluable in troubleshooting.
This past week I ran across a great post from Grant Fritchey called Understand the True Source of Problems. In this post, Grant talks about some strange DBA “best practices” that were implemented to solve problems, but aren’t necessarily the best of anything. It’s an interesting list, and you might read it for entertainment, though I hope none of you are following these ideals at your place of employment.
However the thing I liked about this post is that in the beginning Grant begins with an old doctor joke, which is still funny to me, but it often is the approache taken by people trying to solve problems. They just avoid the issue, and never solve it. That can work in the short term, but I’ve seen this build up over time to the point where people are afraid to make any changes to a system or code because of the fear of causing some unknown issue.
I don’t think that every issue needs to be solve to the root cause. There are plenty of times something goes sideways on one of my computers and a reboot fixes the issue. I’m fine with that, but if it occurs more than two or three times, I do need to solve the problem. I need to be able to dig deeper to diagnose what resources are failing, what performance issue exists, or maybe what line of code is broken. Perhaps I need to step back and examine if some other part of my system, like the disk or network, is causing an issue that appears to be isolated to code. In other words, I need to be able to look at the wider environment (the forest) or dive into the particulars of a process (the trees).
Knowing how to strike this balance is a bit of an art, and it’s what really makes the best problem solvers experts. I can’t teach you that part of computing, but I can teach you this. When you don’t think you’ve completely solved the issue, stop for a minute and take a wider or narrower view. Or maybe both. Look for what you might not have thought of. If you’re like me, there are probably plenty of possibilities that come to mind when you stop and re-examine the issue. Then work your way through those items, discarding the ones that aren’t relevant. Sometimes opening your mind a bit is the best way to find the exact cause of an issue.
I heard a new phrase this week: the data lake. It comes from a Radar Podcast episode where Edd Dumbill talks about data as an asset that should be exposed everywhere in an organization. There’s a blog post on the subject as well.
The idea is somewhat centered on Hadoop, but it could apply elsewhere. Data comes into an organization and then tends to seek and get stored with other data in a large lake. Applications are just a way of accessing the data in the lake, but all the data really lives in a large Hadoop lake of information. In some sense, this isn’t far away from the “single view of the truth” that I’ve seen plenty of organizations attempt. In a relational world this means all data moves from OLTP systems to a large data warehouse, and is then moved to smaller data marts (really warehouse subsets) and is accessed from there.
It’s a good idea in theory, but in the practice of trying to move data around with any velocity between users, with all the copying, cleaning, transforming, and more going on doesn’t work. OLTP systems are needed because there are transactional actions that must be completed quickly and accurately. Moving this data to other systems becomes harder as data volumes and the sheer number of clients (whether users or other systems) increase. The idea that we can keep all of our processes working quickly enough that users won’t get frustrated is likely a dream. The more a value exists in a set of data, the more users will access it. The more accesses, the slower it often becomes, which starts a cycle of smaller subsets of data and applications that subsist on those small data puddles.
Excel is probably the most common example of a data puddle that exists in your organization. A set of data, perhaps out of date, but useful enough to make decisions based on. Infinitely flexible and convenient enough that updates, changes, and more often spawn more and more puddles where the information never gets transferred back to the large lake of a database, whether that’s an RDBMS, Hadoop clusters, or something else.
I think the idea of a large data lake is great, but in a practical sense, much of an organization’s data will never live in the lake. If it does, it will most likely be data that’s been superceeded by information in a puddle somewhere on an employee’s laptop, tablet, or personal cloud.
This week I ran across a post from Glenn Berry on hardware performance that I thought framed the concept of performance value well. In the post, Glenn talks about the latest TPC-E benchmark and the SQL Server performance values that vendors release. Glenn mentions that he always tries to divide out the numbers to provide better comparisons for the various hardware choices.
That makes a lot of sense to me, as I think very few of us would be able to afford the top of the line systems on which vendors run the benchmarks. Most of us try to somehow compare the results in some way and then make our own decisions for our smaller systems. I don’t know many people that run 36 core machines, but I do know lots that need to decide which 4 or 8 core systems they should choose.
The idea of normalizing performance to smaller loads is something we do often. We need to do this, because we often can’t get the same size, scale, or specifications in our test systems as we have in production. As much as we’d like to have them, resources are limited, and we do need to have some way of extrapolating the results in our development and test systems forward to production.
Glenn has a way of doing this for CPUs, and while you might not agree with his method, at least he has an organized way of doing things, and then letting empirical results provide feedback on whether this works well. You should do the same thing, whether you’re trying to gauge disk or T-SQL speed. Develop a hypothesis (or read about how other do so) for measuring your performance on a lesser, and then your primary system. Take time to run some test the same way, even if it’s single query performance on a production system while it’s live.
You won’t get perfect results and scalability, but you’ll develop a few metrics that allow you to determine if your decisions in development will have severe negative impacts. There still might be problems, but you should eliminate more and more of these over time.
This week I ran across an analytics piece that talks about how Big Data and analysis can do more than help marketing, it can also create hard dollars for your company. Companies are using sensors and technology for mundane tasks, but tasks that can save the company substantial, hard dollars.
You may have heard of UPS and FedEx optimizing their routes, and tracking the actions of drivers. Those companies (and others) have learned that a little programming work to analyze and create instructions for their workers can reduce wasted time and fuel, which ultimately means less wasted money. Other companies have reduced their heating or cooling costs with more data and small investments.
I don’t know how many additional cost reductions can take place in most companies, but there are likely changes none of us have thought of that could be easily implemented with technology. It seems like a no-brainer that technology can build better routes, but early on, it wasn’t considered as an option because it didn’t replace a formal manual process. It wasn’t worth paying someone to design routes each day for a truck.
As computer power increases, and more data becomes available, I predict that we’ll find many ways to analyze that data and change the way we run businesses. Many of the ideas will be unexpected, and some may be counter-intuitive, but those of us working with data will have a front-row seat to the ways in which data analysis can transform our world.