Data Darwinism

A great look at the future, where data about you will decide much about your life.
A great look at the future, where data about you will decide much about your life.

A few years ago my son asked me to buy him The Unincorporated Man. After he finished it, he gave it to me and we read all four books in the series, which we both enjoyed. The premise of this future civilization is that each person is their own corporation, selling stock in themselves to anyone in the world. As with a company, the better your performance at life, the higher the price. However there is also accountability, with your actions, jobs, etc., potentially limited by your “board of directors”, who are the shareholders in your corporation. It sounds a little drastic, but it’s not as bad as you might think. It’s actually a neat idea.

It’s also somewhat of the way the world works now, although without all the disclosure. In today’s world it’s actually much easier to hide your flaws and poor performance because the information isn’t always readily available to potential employers. Some of us see this in the poor performance of colleagues, who were hired with good recommendations or interviews. We may find out later that these were exaggerated, though we often can’t (or won’t) do anything about this, suffering through poor performance from the individual or company.

That may be changing. I ran across an interesting article where vehicle drivers were let go from their jobs because clients had rated them poorly or complained about them. There is some controversy here, but it does bring up the issue of more companies that look to “go where the data goes” in operating their businesses. It’s all too easy to begin using metrics, measurements, feedback, and more to make business decisions. This is one of the driving forces being building business intelligence systems in our industry. However if the models, assumptions, or data are flawed, bad decisions are not only possible, but probable. It’s easy to trust the computer’s report more than is prudent, especially when we have no good way of measuring the quality, or even appropriate interpretation of the data.

There are lots of BI systems that work well, and provide companies with many benefits, but there are probably also plenty of them that don’t work well and we don’t hear about them. There are systems that use flawed, incomplete, or otherwise compromised models to help business leaders make decisions. Ultimately a BI system needs lots of human intelligence added to it, including judgement and refinement, constant tweaking, and a bit of common sense. I would hope that using data to cut off some service, fire an employee (or decline to interview someone), or make any far reaching decision has a lot of experience and audit built into the system to prevent its abuse.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

Data Journalism

Using data to tell a story is something that more database professionals should consider.
Using data to tell a story is something that more database professionals should consider.

The open data movement in goverment has produced some amazing data analysis from many sources. Many people are taking freely available data sets and producing a visualization, or an analysis of a problem, or even an application that is useful to the public. It’s one of the ways that technology and data analysis has really changed the world in a way that wouldn’t have been possible before powerful computers and mobile devices.

I ran across a piece on data journalism that talks about a few projects around the world. This is the idea of adding a story, along with context and clarity, to facts. That is what many people are showing in the various projects in the O’Reilly piece, and it got me thinking. Perhaps this isn’t just something that can be done with open data and public services. Perhaps this is something we could be doing more of within all our organizations.

Journalists learn to inform people in a compelling way. Data journalism is based more around large sets of data. Most of the people I know working with SQL Server often understand the data much better than the business analysts. These technologists, usually those performing some type of development tasks, learn how the data is structured and stored, and might notice the patterns and anomalies in ways that business users ignore.

As the future of databases and database workers evolves, I suspect that those people who can learn to tell a compelling story about the data, that can present facts to clients and customers in a captivating manner will be in demand by many employers.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

Learning about Driving with Big Data

This is nice, a year long safety pilot from the University of Michigan. Quite extensive, using 3,000 cars to gather data. I don’t know what they’ll get out of this, and if you read the comments, there’s all kinds of speculation, but it’s a good idea, in my opinion.

Many of the commenters are trying to come up with results before the data in this case. Do we need better driver training? Better design? Driverless cars? Who knows? We should get the data and then decide how to proceed.

This study is a good idea, because I think we realize there are some things about driving we don’t know enough about, but there are also a lot of things we don’t know we don’t know. The unknown unknowns are likely going to impact how we interpret this data later.

I like to see more studies along these lines, but with really anonymous data. Do some work to try and disconnect people from the data, which will be hard and likely not work well. Perhaps we just need to map the coordinates of the cars and their interactions to a neutral space, maybe some coordinate system that doesn’t necessarily correspond to lat/longitudes.

However there will be good data out of this that can help us understand how we might better change driving. I just wish this were with more than 3,000 cars. That seems like too few to me. I’d like to see more like 300,000 cars in a study.

Marketing Data is Exploding

GM Customer Service Facebook Cnoversation
It would be nice to have this be the norm rather than the exception with customer service.

The last few years have seen the rise, and explosion, of social media. While fundamentally social media isn’t introducing much that’s different from actions in the real world, it is making the reach and speed at which we can communicate grow exponentially. Facebook will surpass a billion users soon and Twitter has exploded in the world. Even in the tiny SQL world, Twitter use has grown exponentially in the few years since I’ve used it. It you’re unsure of what’s happening there with regards to SQL Server, search the #sqlhelp hash tag and see what you find out.

All kinds of companies are starting to try and integrate this data into their systems. If you haven’t yet, I suspect you might have to some time in the next few years. There’s a short piece in how this data is in use at GM, and it’s interesting to read. I don’t know how seriously other companies take this, but I do see so many people “following the herd” and as more managers read about other managers implementing projects like this one, we’ll see more adoption.

What’s interesting here is that it gives us data professionals a number of new opportunities. From the sheer OLTP-type work of collecting and managing this data in real-time, or near real-time, to the need to add BI analysis to trends of data. I can see a lot of new projects, and new employment in this area. It wouldn’t hurt to beef up your skills in this area, and maybe get a better understanding of how social media can work, as well as how you can manage the data.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

One Single View

This editorial was originally published on Oct 9, 2007. It is being re-run as Steve is traveling.

When I worked at JD Edwards, one of the goals of our business intelligence system was to house a single view of the truth. I recently saw a blog post by Andrew Fryer that does a good job of explaining what this is. Basically it’s a way for us to view some particular slice of data and ensure that it is consistently accepted by everyone in the company as the “correct” data for whatever it represents.

This sounds a little silly, but it’s actually a problem in many companies. At the recent PASS Summit, Bill Baker did a presentation where he actually showed a realistic example of this as a reason to implement Performance Point Server. If you have a contest at a company for the most sales, who wins?

You’d think this is easy, but is it the person with the most dollar sales? Do returns count? Should we account for size of a store or hours worked? Obviously we could define this, but depending on how different people might run reports or calculate things in their own spreadsheet, there could be different results.

It was the first time that I actually understood what Performance Point brings to a company and why you might implement it. Now I’m not plugging Performance Point here, because I think you could wind up with some tool that requires a couple full-time administrators and developers just to get things set up and maintained. And I’m not sure that you can completely control things with permissions and reports. My guess is that people will always want to pull their own data offline and create their own report (often for the purpose of supporting their own position), but it’s an interesting idea.

The single view of the truth is something I think all DBAs and “data people” want. We want to know that a customer is a customer is a customer. We want to normalize data and have relations that ensure we aren’t duplicating data. We fight through the issues of meanings in one system being translated properly to the next system.

A single view of the truth is hard to create, but it’s a goal that I think is worth pursuing. And implementing a data warehouse is a great way to get started on this. By talking with business people and forcing them to give you rules and mappings, and then implementing a source system everyone can use, it can really ensure that everyone in the company is on the “same page.”

And who knows? Forcing business people to define what that single view is might just help them run the business better.

Podcast Notes

Joe Sibol - The Great MusicI appreciate any and all feedback from people

Music from Joe Sibol. I like acoustic music and stumbled onto Joe recently. If you like it, send her a donation, buy a CD or something.

And if you’re in a band, send me a sample of some music. I’d love to feature some community talent.


Hard Data v Gut Feel

basic instinct picture
Instincts are great, but they need data behind them

One of the mantras that has guided Google through so many of its product development routines is the fact that the data matters. If you can show that there is data to support a decision, it is more likely to be well received than one based on the experience, instinct, or hunches of any person.

Often in the past I wished that the companies I worked for had operated in a similar manner. Many times I’ve had managers decide to pursue some course of action based on something other than data. At times I’ve even had managers ask if we could rework queries or reports to support their position. It wasn’t entirely unreasonable to question the queries we had written as we may not have approached the problem in the best way, but asking for evidence to support a pre-determined result felt wrong to me.

Companies spend thousands, or millions, of dollars every year on all sorts of resources to collect, manage, and analyze an ever growing set of data points. Many of us find our jobs revolving around data, looking for new ways to analyze data and drive better decisions. At the same time, many companies struggle with decision makers that trust their own instincts more than the data reports they review.

There isn’t a lot we can do as data professionals, other than continue to refine the models we build and work more closely to convince managers to tune our models, using the actual data before and after the decisions to analyze the results. We can also champion the ideas that the data is more objective, and should be a significant part of the decision process, while allowing for some input by the people that have a wealth of experience in our particular industry. We want their gut feel, but we don’t want to ignore the data.

As with most things in business, communication and compromise help to drive us forward to better decisions in the future.

Steve Jones

The Voice of the DBA Podcasts

We publish three versions of the podcast each day for you to enjoy.

Social Data Analysis

Tufte Poster
A great visualization from an Edward Tufte seminar.

Today we have an editorial originally published on June 24, 2007. It is being republished as Steve is traveling.

I haven’t been a big proponent of BI as a technology that I think will catch on and become the focal point for many businesses. It’s not that it doesn’t work, but it’s expensive, it’s complex, and it requires a long term investment. While I’m not sure what to do about the last two with today’s IT folks and executive management, I do know that Microsoft and others are helping with the first one. So when I was this item about social data analysis using a new tool from IBM, I became a little intrigued. It’s an interesting idea for the amateur BI person that may be looking for an easy way to examine their data. It’s certainly better from the manual calculations I did in Economics to do regression analysis of data. We often didn’t even have graphics. That is if you don’t count my pencil and graph paper.

This is a good outgrowth of the open source ideas of programming. It’s often said that many eyes on a program in the community model, produce a better piece of software. After all, developers work harder if they know hundreds or thousands of people will look through their code and more bugs are likely to be found because more people are working on the problem. I think that’s what IBM is trying to do here: get many people to examine data.

While the sample visualizations tend to look like something just thrown together by people, I can see as graduate students, government officials, and others looking at lots of public data, and data that doesn’t necessarily need to be secure, you could possibly get some real help on problems.

And maybe the most important thing, this gets a lot of people using the same tools. From what I’ve seen of many BI tools, they’re expensive and there are lot of choices, so trying to facilitate a wide reaching data analysis program could be hard, even among universities, each of whom might have their own software.

While I’m not sure about BI catching on as a mainstream, every company has it, technology, I do think it’s very cool and it’s application to a wide range of problems is likely to give us some new solutions.

Real Time Dangers

There can be tremendous volatility in short term data.

There seems to be a quest to move closer and closer to real time decision making. Gather data, analyze it, and make decisions instantly, preferably with the help oif expert systems. That makes some sense, and as shown in the article, it can allow analysts to respond to events very quickly, performing verification, fraud checks, or just about anything you can think of.

It’s a good goal, and it can definitely help many companies make more informed decisions at any point in time. However there are problems as well. Sometimes short term data can fundamentally distort the picture of reality. Some of our large stock market meltdowns are the result of automated systems, perhaps not so much expert systems, as very quick reacting systems that might overvalue the last few pieces of data and make decisions that are less than optimal.

We cannot program systems to handle every situation, nor can we even give enough guidance to inexperienced humans that might be involved in the workflow. Instead we ought to recognize that short term data might not represent longer trends and ensure that we have people looking over the data across a longer timeline before any important decisions are made.

Too often it seems we build systems, assuming that more data, delivered quicker, is the way to prevent poor business decisions. We might easily overwhelm other systems, or people with too much data, delivered too quickly, or used to inform decisions too quickly. Real time systems can provide many benefits, but their use should be tempered with this saying I have long believed: computers give us the power to make mistakes quicker than ever before.

Steve Jones

The Voice of the DBA Podcasts

The HR Scorecard

Dundas controls make building interesting dashboards easy

Are you looking for a BI project in your company to get some experience with? I’ve got an idea that you might try tackling as a side project, or even an informal project within the business.

One of the problems that many corporations face today is a regular turnover of employees. That causes a loss of productivity, moral suffers, and projects may run late. This might be because of poor working conditions, poor management, or some other factor. While you might not be able to change that, perhaps you can help bring some visibility to potential problems.

Human Resource departments are similar to IT departments in that both of them are a cost to the organization without generating revenue. These are overhead departments, and often receive limited budgets and assistance to accomplish anything beyond the bare minimum. These departments can add value, and the good ones help companies in many ways, reducing the costs that might otherwise occur.

I have rarely seen HR departments with deep analysis tools available to them that might help them function more efficiently. If you want to tackle a BI project, think about building a detailed scorecard or set of KPIs for the HR department that might help them better understand what is happening in the company. Watching for the impact of vacations, turnover, etc. and aggregating exit interview data might help HR improve the conditions at the company, perhaps even for you.

I’d consider working on an HR BI project, even during some off hours if you have an interest in this area. I’m sure the HR people will appreciate it, and they might be the ones that are in the best position to help you with your career down the road.

Steve Jones

The Voice of the DBA Podcasts

Pop Tarts and Hurricanes

The second most dold item at Wal-Mart before a hurricane.

What do people buy most of before a hurricane? Wal-Mart determined that batteries are the number ones sales item before a hurricane, but the second most sold item was surprising: pop-tarts. That’s the type of intelligence that can come from analysts asking the right questions about their customers and then IT delivering systems that can help them get an answer.

More and more companies are starting to deal with big data. Big data is usually seen as extremely large sets of data. Those data sets  often exceed the capability of what a single database server can handle, and place a strain on existing IT infrastructures, especially when there is an explosion of new datab types. Hadoop is an open source framework that’s designed to deal with large sets of data and help with the processing and analysis of these large data sets.

Microsoft recently added support for Hadoop to SQL Sever. Granted it’s only in the Parallel Data Warehouse, which is a highly specialized version of SQL Server integrated with specific hardware. However I am sure this will eventually make its way to other editions over time.

Update: I made a mistake, there is a connector for non PDW SQL Server 2008 R2 instances.

I don’t know if I would recommend learning Hadoop, but the techniques of processing large sets, and performning analysis on big data is important., Even more important might be the writing of well performing T-SQL code that will be used. I’d recommend you read about ways to write better code, and learn to integrate those skills into your daily work. You don’t always need them, but you often don’t know that when you are writing the code.

It doesn’t take much more effort to write better code the first time, but it does take more effort to learn to do so. It’s an investment in your career that you ought to be regularly working on to become one of those talented people that is in demand in the future.

Steve Jones

The Voice of the DBA Podcasts