The Clustered Index is not the Primary Key

I was reading through a list of links for Database Weekly and ran across this script from Pinal Dave, looking for tables where the clustered index isn’t the PK. It struck me that this is one of those facts I consider to be so simple, yet I constantly see people confusing. If you click the Primary Key icon in the SSMS/VS designers, or you specify a PK like this:

CREATE TABLE ForSomething
  (
    SomeUniqueVal INT PRIMARY KEY
  );

What will happen is that a clustered index is created on this field by default. It’s not the the PK must be clustered, but that SQL Server does this if you don’t tell it otherwise. Tables should have primary keys, and while you can debate that, most knowledegable SQL Server people I know want a PK on tables. There are exceptions, but if you can’t name them now, use a PK.

However the PK isn’t a clustered index. They are separate concepts. The PK can be clustered or non-clustered, and what you choose it up to you. However, I like Kimberly Tripp’s advice. Choose the clustering key separately from the PK, keep it unique, fixed, and narrow.. If they’re the same, fine, but don’t try to make them the same. Choose what works well for your particular table, which means thinking a bit.

You get one clustering key, and it’s worth spending five minutes debating the choice with a DBA or developer, or even post a note at SQLServerCentral. Changing the choice isn’t hard, but it can interrupt your clients’ work on your database, so try to make good design choices early, without blindly accepting defaults. It’s worth a few minutes of your time to make a good choice.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio (2.0MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

Naming Is Hard

I ran across Aaron Bertrand’s naming post recently, which I liked overall. Your team needs to agree how things are named, and be consistent, but I agree that some of the rules I’ve seen put down by people aren’t practical. However I also know that naming things becomes hard, especially over time as your systems evolve and the scale grows.

I’ve run into some strange naming patterns in the last few years that struck me as odd. I grew up in the Hampton Roads area of Virginia, and with a parent selling real estate, I traveled all over the area, often reading a map to navigate. I can rarely remember seeing the same names used over and over, despite the fact this was a relatively populous area.

However when I came to Colorado, I found things to be different. Either the people in charge of naming weren’t creative, or they were very, very lazy. In some of the developments I entered, I’d find the same name used many times on adjacent streets. You can find Wolfe St, Wolfe Pl, Wolfe Ct, and more all next to each other. While a computer can easily differentiate these sets of characters, this can easily create confusion in humans.

On a note closer to our livelihood, I have seen some strange databases objects. The tables in a JD Edwards system were coded with a short alpha sequence and then a 4 or 5 digit number. Always fun to remember what “FS3401” means. However I saw one recently that stunned me. I ran across a SQL Server database that was quite large, not in data size but in the number of objects. Apparently a developer or DBA at some point decided it wasn’t worth trying to name tables to match some entity and used GUIDs are names instead. I don’t know about many of you, but comprehending which GUID stores which data might have me resigning from that position relatively quickly.

We won’t all necessarily agree on how we should best name objects and entities, and that’s fine. However I would implore you to at least consider the humans that will have to work with your database. Use some creativity to build names that are easy to differentiate and pronounce for the future DBAs and developers that need to enhance and query your system.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.0MB) podcast or subscribe to the feed at iTunes and LibSyn.

Relational By Default

I’ve read a few pieces from Vladimir Khorikov and I tend to like the thoughtful way in which he approaches building software. When I saw a link to his SQL v NoSQL databases, I was intrigued to see what his thoughts would be. It’s a good read, though with relatively few comments or debates posted at the time I read the article. I was hoping for more and I encourage you to add your thoughts.

The main contentions in the piece are that any of the simpler development techniques that we can use with NoSQL databases don’t remove the need for implementing data quality or management features. They just require those features to be implemented by the developer. Specifically the need to deal with multiple schema versions over time, or the requirement we enforce parent child relationships.

Perhaps that’s not too bad. After all, if we evolve schemas and the code can easily deal with the changes, that’s good. However I think the point about having relational storage when you can, and adding in other stores makes sense. This is what I really believe as well, since we often need to query and combine data, which is something relational stores do very well and very efficiently.

Ultimately I’m sure we will continue to see arguments about relational and NoSQL models being better or worse for particular problems. I actually welcome the arguments  because there are certainly domains of problems where one system works better than the other, and it’s good to develop some clarity about those particular problems. However there are also likely many situations where multiple platforms can work, and in those cases, we should use what works well for our team.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

Poor Data Modeling – T-SQL Tuesday #72

tsqltuesdayThis month Mickey Stuewe hosts the T-SQL Tuesday and she has a great topic. Data Modeling is something few people seem to do, especially developers, and it often can go wrong. That’s the topic, and I have a short story.

T-SQL Tuesday is a monthly blog party, on the second Tuesday of the month. The rules are on Mickey’s blog, and you can read them there, but essentially you write a post every month.

Or later.

The Aging Model

I once worked at company that shall rename nameless. We had a system built before my time which had a hierarchical set of entities. I can’t disclose the exact industry, but imagine that we had a list of items like this:

  • Manufacturers of products
  • Distributors of products
  • Warehouse holding products
  • Vendors selling products.

In this structure, we have a strict linkage where each item below the next is contained in the item above it. In other words, a manufacturer works with specific distributors and only those distributors. Distributors don’t work with other manufacturers.

I know this sounds contrived, and it is for a supply chain. However not for the industry in which I worked. So imagine we’re 100 years ago when power was more concentrated with supply chains.

This resulted in tables like this:

CREATE TABLE Manufacturers ( manufacturerid INT IDENTITY(1, 1) CONSTRAINT manufacturer_PK PRIMARY KEY ( manufacturerid ) , manufacturername VARCHAR(200) , primarycontactid INT -- ... ); GO CREATE TABLE Distributors ( distributorid INT IDENTITY(1, 1) CONSTRAINT distributor_PK PRIMARY KEY ( distributorid ) , manufacturerid INT CONSTRAINT Distributor_Manufacturer_FK FOREIGN KEY REFERENCES dbo.Manufacturers ( manufacturerid ) , manufacturername VARCHAR(200) , PrimarySalesPersonid INT -- ... ); GO CREATE TABLE Warehouses ( warehouseid INT IDENTITY(1, 1) CONSTRAINT warehouse_PK PRIMARY KEY ( distributorid ) , distributorid INT CONSTRAINT Warehouse_Distributor_FK FOREIGN KEY REFERENCES dbo.Distributors ( distributorid ) , warehouse VARCHAR(200) , regionid INT -- ... ); GO

 

Each of these links to the item above it. This means that I might have a Manufacturer  table like this:

manufacturerid  manufacturername   …

————–  —————-  

1               Acme

2               Big Product Corp

With warehouses linked as shown.

warehouseid  manufacturerid warehousename   …

———–  ————– ————-  

1            1              Denver Central

2            1              Kansas City East

3            2              Santa Fe

4            1              Billings Depot

This would mean that I used distributors that worked with a warehouse, and their data would be.

distributorid warehouseid distributorname  …

————- ———– ————-  

1            1            Denver City

2            1            Denver South

3            1            Aurora

4            2            Kansas City Distributors

5            3            New Mexico Wholesale

If I wanted to get a list of the distributors that carried a manufacturer’s products, I’d have to join through the warehouse table.

SELECT manufacturerid , ... FROM dbo.Manufacturers AS m INNER JOIN dbo.Distributors AS d ON d.manufacturerid = m.manufacturerid INNER JOIN dbo.Warehouses AS w ON w.distributorid = d.distributorid ...

Not a big deal, but we had 5 levels of this appear over time. Which means that queries that might need a higher level had to cross tables in between to join data. These also weren’t narrow tables, with a decent amount of meta data for each entity. Indexing helped, but certainly we needed better rollups of data, and performance suffered as the amount of legacy data grew.

What’s more, over time, we learned that business changes. A warehouse might start to work with multiple distributors, and our model couldn’t cope.

Eventually we embarked upon a project that set of link tables between entities, so that we had just IDs in a table that showed linkages where appropriate. This was ongoing when I left, but it was a major disruption and problem for the company.

I’ve tried to avoid embedding hard relationships based on business contracts in my models. Certainly some FKs make sense, like order details linked to a specific order, but many other relationships aren’t as tightly bound, and it is possible that they will change. It doesn’t cost me a lot in modeling to treat suspect relationships as many:many, and if application development does this from the start, it’s an easy architecture to incorporate in their development.

Data modeling is always a struggle as we often work with incomplete information because an organization will morph over time. It is, however, worth taking some time to think about the possible ways in which things might change in the future and allow for some flexibility in your model where possible.

Positive or Negative

While watching a presentation recently on a SQL Server topic, I was surprised by a design technique that the speaker used. I hadn’t ever thought of the issue, but it caught my eye and made me rethink the way I look at design, and even data.

In this case, the speaker was building a table and looking to make a row active or inactive. This person chose to use a column name of “inactive,” which surprised me. I’ve always used an “active” column if I am trying to mark a row in a binary fashion.

That got me thinking. I wonder if there is some impact to designing entities either way, but I wanted to ask this question:

Are you positive or negative. Would you use an active flag or an inactive flag?

This could be extended to any other type of choice, but do you look to flag an action as happening, or not happening. As readable or not readable? As usable or ignored? I’m not sure it matters, but I’d certainly be curious if anyone has any reason why you might choose one or the other. Is there some meaning conveyed? Let us know this week.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.9MB) podcast or subscribe to the feed at iTunes and LibSyn.

Logging Tables

I was chatting in one of our DLM webinars and someone asked a question about logging data. I mentioned a few things, but this individual followed up and wanted more information. Specifically, how do you build a logging table?

It’s an open ended question, but it’s perfect for a poll. How do you construct a table that you use for capturing lots of information that you may or may not query later, but you might only query in a very ad hoc fashion when something has gone wrong. Usually that’s the only time I look through logs, when I’m investigating some type of issue. With that in mind, this week the question is:

How do you design a logging table?

I’m looking for two things here: the DDL and the rational for the design. What is good about this design? How does this particular structure (with indexing) allow me to find answers to the questions I might need to ask.

I realize this is very open ended, but the concept of logging is also open-ended, used different ways for different purposes. If you have problems with logging structures, let us know what design techniques to avoid as well.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.7MB) podcast or subscribe to the feed at iTunes and LibSyn.

ER Modeling

I saw someone recently note that an Entity-Relationship model is important for your database development project. That UML is important, and the model should be actively maintained. I agree.

However, I’ve usually not seen this done. Typically when I’ve seen an ER diagram in a shop, it resides on a wall, printed out on multiple sheets of paper. Often there are as many handwritten additions on the pages as there are marks from a printer. The diagram is also usually out of date, even allowing for handwritten notes.

Maintaining a diagram is hard. Deciding to actually spend time working on one instead of just talking about a few entities and building them, may be harder. This week, I wanted to ask you if you’re up to the challenge in your organization.

How many of you maintain an ER model for your database?

Do you have an ER diagram you’ve built? Whether it’s in ErWin, Visio, or Paint. Is there a diagram that’s actively maintained and that everyone can use for designing changes to the database(s) in use? Do people actually consult this and use it in design meetings? Let us know this week.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 1.5MB) podcast or subscribe to the feed at iTunes and LibSyn.

No More Foreign Keys?

I struggle with removing FKs as a general idea. It’s presented in a piece from a developer. It appears to me that if you do this, you are merely moving the problem, and potentially causing other issues. Apart from Grant’s response on performance, which is very true, there are other challenges with microservices and separate databases for each service.
Let’s say we are doing something trivial, like a simple e-commerce site. I could have a separate database for products, one for orders, one for shipping actions, one for customers, etc. However as I scale, I’m not using intra-database queries or FKs/joins/other techniques to verify information for an order. Now I’m sending messages. At small scale, these are trivial. At large scale, my network traffic and messaging is vastly growing and potentially a bottleneck.
Also, since there is a “products” database, and an “orders” database and a “customers” database, I still can’t necessarily scale out these microservices beyond their own machine. Perhaps I can scale them higher in terms of each being as large a machine as I can have now with a single Oracle/SQL Server/MySQL/etc. box, but I’m still have a scaling issue. I also now have a new messaging problem I need to manage and architect. If I lose one database, how gracefully can my application degrade or does it start to fail with unforeseen interactions? Do I create more frustration when customers cannot place an order because we can’t find the customer db or because the site is down?
Certainly there are domains of problems that would work fine here. Spotify is probably a good example of this. There may be e-commerce systems that allow this, perhaps being willing to notify customers after some time (hopefully minutes, perhaps hours) that a product is out of stock because the message from the orders database didn’t get to the application. There are certainly ways to copy and replicate some  data, assuming that “fresh enough” works for your application. However that’s a business decision more than a coding decision. This also means that you also have a new load on your system to move data. I also think the more you work with singleton rows and single objects in an application context, the less you need RDBMS capabilities.
Ultimately I think FKs and strong RDBMS systems work very well in many situations. They work less well in many situations and your domain may qualify for those areas where a NoSQL, a multi-db, or other architecture works. We should certainly investigate and write about where things work and don’t work, but I wouldn’t assume the RDBMS with it’s PKs and FKs isn’t a valid, well thought out and incredibly useful architecture.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.5MB) podcast or subscribe to the feed at iTunes and LibSyn. feed

Making Guesses

I was reading a bit about the design and planning stage for a new database that will support some random application. The piece isn’t important, but I did notice one thing. The guidelines are generic, encompassing most of the things we might want to think about: structure of data, constraints, volumes of data and rate of growth, and more. It’s a good list, and one many of you have probably considered at some point if you’ve designed a system.

However, I’ve rarely been able to actually do more than guess at many of the answers. Even much of the data I can write down is often little more than a guess. Even when I know the DRI constraints, I hesitate to document them separately from the code itself. Mostly because I find designs change early and often, and the documentation is rarely able to keep up with code. The same thing applies with the administrative type decisions I make.

I’m also nervous about documenting guesses.   I find that far too often any answer I give takes on some level of truth with the developers, often being considered an immutable fact. The assumption that an early guess will match reality later seems to often blind everyone and limits the level to which we can adapt to changing situations.

As an example, if I tell a SAN admin that we think we need 5GB of space, but we find that we need 20GB in a short period of time, I meet resistance. A debate, or argument ensues, we search for evidence that we need to change, and a very inefficient process ensues. The same thing occurs when developers, security admins, and especially managers hold too tightly to estimates and guesses.

Certainly some guidelines are needed, but when I’m not sure, I tend to give wide ranges, just to prevent too many expectations from taking hold in people’s minds. We should know that our plans will rarely survive production deployments and workloads, but somehow we forget that when looking at the guesses we make early on.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.4MB) podcast or subscribe to the feed at iTunes and LibSyn.

Defining Foreign Keys at Table Create Time

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

How many of you can define a foreign key when you create the table? Probably a few of you, but I bet most of you are like me and don’t necessarily know the syntax. I have often defined these later, which is fine. As long as they get defined.

However I knew I needed a specific key when I was creating a table and couldn’t remember the syntax, so I had to search and learn how. I used Google and saw a few links from MSDN, but those tend to be overly documentation heavy. One of the links was to SQL Authority, run by Pinal Dave. He does a great job of simplifying things (and he’s a friend), so I followed that link. I could see the syntax and tested it in minutes.

It’s easy to create a Primary Key in CREATE TABLE, and I wrote about that for one of my first SQLNewBlogger posts. The Foreign Key is similar, but not quite as simple.

Imagine that I have a parent table:

CREATE TABLE orders ( orderid INT IDENTITY(1, 1) CONSTRAINT Orders_PK PRIMARY KEY ( orderid ) , orderdate DATETIME , complete BIT ); GO

I now want to create a child table and link the orderid in the child to the parent. I can do it like this:

CREATE TABLE OrderLines ( orderlineid INT IDENTITY(1, 1) CONSTRAINT OrderLines_PK PRIMARY KEY ( orderlineid ) , orderid INT CONSTRAINT orderlines_order_fk FOREIGN KEY REFERENCES orders ( orderid ) , qty INT ); GO

Note that I define a constraint inline, just as I did for the parent. However I note this one is an FK and it "references" another table. In this case, I list the Orders table and put the columns in parenthesis.

Quick, easy, build your FKs inline when you know about them in advance.

SQLNewBlogger

While trying to remember how to create an FK, I ran a search and chose the reference below to start. A matter of seconds had me seeing the syntax and writing the code.

Putting this together was less than ten minutes.

References

Creating Primary Key and Foreign Key Constraints – http://blog.sqlauthority.com/2008/09/08/sql-server-%E2%80%93-2008-creating-primary-key-foreign-key-and-default-constraint/