Skip to content

Statistical Analysis

I ran across a book recently that is written to address some of the statistical issues the author has found in many scientific papers. It’s been released online for free, but if you like it and would like a more portable format, you can buy it for a reasonable price.

I haven’t dug into the book much, and I certainly don’t consider myself to be a statistician. That wasn’t one of my stronger subjects in school, and it’s a weakness I’d like to correct at some point. However I have had to use some of the statistical functions in past applications, and I wonder if I was using them wrong.

This week I decided to see how many of you are using more complex math in your systems. I’m hoping you understand how the functions work, but I wanted to ask what you’re using:

What statistical functions have you implemented in a production system?

I am thinking of functions beyond basic aggregates like SUM and AVG. Are you using standard deviations, linear regressions or some other complex functions? Have you made use of built-in functions in T-SQL, R, or some other language? Are you implementing custom functions in code or CLR Aggregates?

I think this is one of the areas where our profession will grow more and more across the next decade. As we deal with lots of data of varying types, and our organizations look to gain some strategic advantage through deeper insight into their information, we will have lots of chances to experiment and learn more about complex data analysis.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.1MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

The Hidden Patterns in Data

There’s a great TED talk on statistics, and one I’d recommend you watch. However, it’s not so much the topic and data presented, but the way in which it’s presented and broken down that becomes interesting. If we ignore the topic, and just look at the analysis, a few  things stand out that I’d apply to business.

The first noticeable thing for me was the breaking down of the large bubbles into smaller ones. I think that all too often we try to boil down information to a single, or very limited number, of values. While the overall average or graph of a set of data can be helpful, it can also be distorted by outliers in the underlying data. Seeing the data broken part into quintiles helps to analyze how closely the overall average matches up with the data.

The other item that I think sometimes get lots of analysis of business data is the change across time. We often look at year over year, or last period’s values, but we don’t dive much deeper into the changes over time. Seeing a series of numbers on a page works well for some people, but the animated visual can be helpful in discerning a pattern or making a decision based on data.

This is a great look at statistics, and it makes me think many of these techniques can be handy in looking at something like server performance metrics and deciding if we need more (or less) resources, or even if we have good candidates for consolidation. I’m looking forward to trying to adapt these tools for my own uses.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.6MB) podcast or subscribe to the feed at iTunes and LibSyn.

Visualizing the Tally Table

I was reading Dwain Camps’ article on Time Slots and thought it was a very interesting solution to a problem I’ve had a few times. Getting time slots inside of a period that I want to query. If you have a similar need, or want to learn more, I’d urge you to read the article.

It’s always easier to join to a set of data that matches what you need than to try and filter out other rows. SQL excels at joins, so whenever possible you want to join to data. As such, when I was looking at Dwain’s code, I thought the way he listed the tally table was very interesting. I’ve seen plenty of these generated, but I hadn’t run across someone spelling it out in comments. In case you are wondering, Dwain had code like this:

WITH Tally (n) AS ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 -- zero-based -- Returns exactly 86400 rows (number of seconds in a day) FROM (VALUES(0),(0),(0),(0),(0),(0)) a(n) -- 6 rows CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) b(n) -- x12 rows CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) c(n) -- x12 rows CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d(n) -- x10 rows CROSS JOIN (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) e(n) -- x10 rows ), -- = 86,400 rows

That’s a great visualization, and one I plan on using in the future. It makes it easy to see what’s being generated and at what scale.

For example, I can easily do this:

SELECT ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL ) ) FROM ( VALUES ( 0), ( 0) ) a ( n ) -- 2 rows

which returns a single column table with the values 1 and 2 in it. Two rows.

Let’s say I want to now build a list of 12 rows. I could do this a few ways. One is to multiple 2 x 6 and get 12.

SELECT ROW_NUMBER() OVER (ORDER BY ( SELECT NULL)) FROM ( VALUES (0), (0) ) a(n) -- 2 rows CROSS JOIN ( VALUES (0), (0), (0), (0), (0), (0) ) b(n); -- x 6 ;

Or I could give myself more flexibility to add and remove data with comments by doing factorials. How about 2 x 3 x 2 = 12?

SELECT ROW_NUMBER() OVER (ORDER BY ( SELECT NULL)) FROM ( VALUES (0), (0) ) a(n) -- 2 rows CROSS JOIN ( VALUES (0), (0), (0) ) b(n) -- x 3 CROSS JOIN ( VALUES (0), (0) ) c(n) -- x 2 ;

That gives me the same result: 12 rows. Of course, I can easily expand this quickly to thousands of rows.

The technique isn’t anything new, but the visualization is interesting, and to me, this is much easier technique to see and understand when you run into it in code. Right away I know I’m generating xx rows and I can easily see how to grow or shrink the number of I have the need.

Who Am I?

I saw Brent Ozar updating his About Me slide, and thought it was interesting. Not sure if I like what he’s done, but it’s different, which usually attracts attention.

I updated mine last year, with some help from the people at Red Gate. Here’s my slide.

2015-05-26 18_30_25-CI for Databases.pptx - Microsoft PowerPoint

No description of jobs, what I’ve done, my age, company, etc. No MVP or other awards or certifications.

I wasn’t sure about this early on, but I like it. I can tailor the talk about me to the situation, making it short or long. I also include the information on the right at the end of the deck as the last slide, so people can grab it there.

I’m not sure if this works, but I have seen a bit more traffic to my blog and more Twitter/LinkedIn activity.

I also get some fun comments when I’m actually wearing this shirt at a talk.

Changing NULL to NOT NULL – Dealing with Data

I wrote recently on how to alter a column in SQL Server from NULL to NOT NULL in a simple way. However I didn’t cover some of the cases where you have data in the column, or other restrictions that you might need to deal with. In this post, I want to look at a few options if you have data in the table.

Let’s take my SimpleTable and add data:

INSERT Simpletable VALUES (1, A, 1) , (2, B, 0) , (3, C, NULL) , (4, D, NULL) ;

Now I have 4 rows. If I run this:

ALTER TABLE dbo.SimpleTable ALTER COLUMN Status TINYINT NOT NULL;

I get this error:

Msg 515, Level 16, State 2, Line 2

Cannot insert the value NULL into column ‘Status’, table ‘Sandbox.dbo.SimpleTable'; column does not allow nulls. UPDATE fails.

The statement has been terminated.

Essentially, SQL Server can’t mark this column as NOT NULL because there are NULLs in there. What can I do here?

UPDATE the Table

The only thing I can do is to change the data. I can run a query that removes all the NULLs in the column by setting them to some value. That’s what I’ll do here. I can choose a value here and run an UPDATE statement. Something like this:

UPDATE dbo.SimpleTable SET Status = 0 WHERE status IS NULL;

Once I do that, I can easily run my ALTER and it succeeds.

 

Unit Testing in Philadelphia

I’m helping teach a pre-conference session on Friday, June 5, 2015 at the Microsoft office in Philadelphia. This is an all day, paid for event, that looks at how you can use a framework to write unit tests for your T-SQL code. We’re the day before SQL Saturday #390 in Philadelphia

I really believe in testing, and am trying to advocate for it in many places. I’ve delivered a testing session that has been well received at quite a few events and this is the first time I’m trying a full day training class.

I am working with Sebastian Meine, the founder of tSQLt, to present the class. We’ve got a busy outline, looking at a variety of ways that you can write tests and use them to find problems in code. Here’s what we’re covering.

  • Introduction to Unit Testing
  • What is tSQLt?
  • Your First Test
  • Executing Tests correctly
  • Effective use of Assertions
  • Separation of Concerns
  • Testing Exceptions
  • Test Case Heuristics
  • Dealing with Test Data
  • Other Types of Testing
  • How Unit Testing fits into your Development Process

At the end of the class, you should have some good ideas on how to build and structure tests in your own environment and be ready to start testing on Monday.

I hope to see you there, and register today if you want to learn more about unit testing in SQL Server.

Native Audits

As our databases contain more and more information, it is becoming increasingly more common to implement some sort of auditing feature to ensure that our data is not being inappropriately accessed or altered. It’s not only the management of our organizations, but also regulatory rules are becoming more numerous, and some may think onerous, which increases the burden on the DBA.

There are numerous products available to help here, but SQL Server has included it’s own internal features since SQL Server 2008. However, it seems that I encounter many people that are unaware of the SQL Server Audit feature, and indeen, may not be aware of the Extended Events system on which it’s based. This week I wanted to ask if you are using this feature.

Are you using SQL Server’s native Server Audit and Database Audit features?

If you can share and reasons or details on how you are using the audits, or how easy it is to administer them, it would be interesting. I think we do need to increase the level of auditing on our systems at least to ensure that we are not experiencing any inappropriate data access. At a minimum, we should be monitoring for privileged level access or account changes.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.8MB) podcast or subscribe to the feed at iTunes and LibSyn.

DB_Owner Querying for Database Options

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

What can a user with the db_owner database role do? I assumed anything inside of the database (create/alter objects, assign permissions, back up the database, etc). However someone asked recently about whether someone could read database properties. I’d assume they can, but I needed to check.

I decided to start by creating a new login. I have lots on my test instance, but I went with just building a new one. I used the GUI to add [DBOwnerTest] because it’s quicker. I assigned a password, set a default database, and mapped this user to db_owner in a database.

dbowner_a

I then opened a query window and changed the connection:

dbowner_b

Now I could easily run a query for properties and see the results:

dbowner_c

Note the connection at the bottom of the image above.

Here’s where it’s a little interesting. I disconnected Object Explorer and connected back as DBOwnerTest. I see this:

dbowner_d

Looks normal. According to the BOL documentation for permissions, db_owner gets View Any Database as well as control over their own database. I can see the properties of the Sandbox database (where I’m db_owner).

dbowner_e

But I can’t see properties of other databases.

dbowner_f

I also see my login, but not the couple dozen others I have, other than sa. That’s curious, and perhaps not good. However this isn’t the place to delve into that.

dbowner_g

I have the ability to query through databasepropertyex(), which I’ve documented in another post.

Certainly db_owner conveys lots of rights to the user, and certainly the ability to see some of the outside of the database container, such as the options and properties, as well as other databases.

Writing

This was based on a question I saw posted at SQLServerCentral. It took me about 5 minutes to set up a test login and query for information. I had to perform a few searches and try some queries. I spent a few minutes researching databasepropertyex(), which became another post.

All told, this was about a 15 minute post.

You can do this. Join the #SQLNewBlogger group and start documenting your career. You can see all my posts that fall into this area by looking through the SQLNewBlogger tag here.

References

Permissions of Fixed Database Roles – https://technet.microsoft.com/en-us/library/ms189612%28v=sql.90%29.aspx

Naming Confusion

When I first worked in a corporation that had many servers, I had to learn the encoding of our systems. We had SUIS01, SUIS02, NAIS01, NAOP01, etc. Each name designated a location and department, but no information as to what applications were running. That was a separate thing to learn, which thankfully didn’t change often as procuring and setting up a new system was a relatively rare event.

However I also ran into a few systems named “Dorothy” and “Wiz”. The other administrators griped constantly about the unprofessionalism of those names. I’m not sure I think SUIS01 is better or worse than Dorothy, but that’s a separate debate. In both cases, we had to memorize what applications were actually running on which systems. Since all of them were managed by IS, regardless of the department that owned them, we were responsible for understanding the purpose of the server.

I ran into a situation recently where a company had a server called SalesSQL, which was there production server. The development server was SalesSQL_Dev, which is fine until you see someone opening SSMS, thinking they’ve connected to the right server since they see “SalesSQL” in the connection dialog and starting to execute code. Granted, even a rookie should be able to notice the “_Dev” (or a similar “Dev_SalesSQL” name), but mistakes happen.

I wonder sometimes if naming systems too closely to similar systems in different environments is a good or bad idea. I especially get concerned if shortcuts are taken with database names like “Sales” and “Salesqa”. I think it can be easy to confuse those systems when you become stressed or are in a hurry and are glancing at a small drop down box.

Ultimately I’m not sure there’s a good method for naming systems that prevents confusion, and ensures people are working in the correct environment. I have tended to use suffixes in my career to denote development and test environments as I can more easily differentiate a dev environment, but I wonder if I shouldn’t be using prefixes instead. Having to retype the beginning of a name (or delete the “dev”) in the connection dialogs might force me to do extra work to connect to a production system. That little extra bit of effort might be just the thing to reduce the chance of performing development work on a live system.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.5MB) podcast or subscribe to the feed at iTunes and LibSyn.

Controlling Alerts

None of us likes to be on call for our organizations. Late night phone calls, especially for something silly like adding a new user to a system or rebooting a machine, can make for irritable employees in the morning. Too many nights in a row of problem calls, legitimate or not, and I’d argue that an employee’s effectiveness during the day is substantially diminished.

Since many of us are paid to be creative and improve systems over time, is regular firefighting a good use of our knowledge? This might be even more true if we find ourselves receiving escalation calls on a regular basis and our daily workload doesn’t change.

I ran across a company that had set up a rule for alerts and pages. If any alert was set up in an automated system, it had to have corresponding documentation, with a resolution, in an operations manual. That’s an interesting idea, and certainly promotes the idea of performing root cause analysis and understanding your systems in advance. However does this really reduce any escalation? 

Perhaps if there is a team available 24/7 that triages initial alerts this helps, but if there’s a small team of people that must support systems, is it worth the effort to document every issue? I suspect that a better policy is that repeating alerts, those that occur regularly, need to be analyzed and a more permanent solution implemented. I know that’s easy to say and hard to do, but that really is the best policy. At least if you want to ensure your employees are working efficiently over time.

For those calls that are ad hoc, I recommend a log of actions taken be kept in an easily accessible location, in a text format for searching. That way if a different person receives a similar alert, at least they’ll have some idea of what actions the last person performed.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.6MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

Follow

Get every new post delivered to your Inbox.

Join 5,260 other followers