Common SQL Server Mistakes – Functions in the WHERE Clause

This continues my series on Common SQL Server mistakes, looking at more T-SQL mistakes.

What’s Wrong?

If you saw a query like this, would you see a problem?

  , o.CustomerID
  , o.Qty
from Orders o
where datepart( yyyy, o.OrderDate) = '2010'

If there are 1,000 orders in this table, there probably isn’t an issue. But if there are 1,000,000, then this is an issue.

Why? Let’s examine the execution plan:

This table has 1000 rows in it, but it doesn’t use indexing to find those orders that were placed in 2010. Instead it scans all rows. The reason is that the function being used in the WHERE clause means that the index cannot be used.

Instead, what you would want to do is write the query like this:

  , o.*
  , o.Qty
from [OrderItems] o
where o.OrderDate >= '20100101'

In this way, we eliminate the function from the WHERE clause and allow the query optimizer to take advantage of the indexes on the column OrderDate.

You see similar issues with queries like:

from Person.Contact
where left(Lastname, 1) = 'S'

This can be fixed as:

from Person.Contact
where Lastname like 'S%'

Basically you want to move the function away from the column and put it on the other side of the comparison so that indexes can be used.

Too often we have developers writing queries like this, assuming that the functions are efficient. They are, but when they are executed against every row in a table, an index can’t be used for seek operations, which are always quicker than scans for any significant data set.

When you are writing queries, do your best to avoid functions against columns in your tables. Instead try to rework the query to move the function. An alternative that I’ll blog about another time is computed columns.

Common SQL Server – Not Indexing FKs

This series looks at Common SQL Server mistakes that I see many people making in SQL Server.

Foreign Keys

It’s way too often that I see people building databases without including declared referential integrity (DRI) in their databases. Even when I see people setting a primary key on tables, it seems that often they ignore foreign keys and creating linkages between tables that link them together.

However, even when people have declared a FK, they often don’t create an index on that column. Perhaps they assume that SQL Server will create the index like it does for PKs, but it does not.

If I create these two tables and join them with a FK:

CREATE TABLE [dbo].[Products](
    [ProductID] [int] NOT NULL,
    [ProductName] [varchar](50) NULL,
    [ProductID] ASC

CREATE TABLE [dbo].[ProductDetails](
    [ProductDetailID] [int] NOT NULL,
    [ProductID] [int] NULL,
    [SKU] [varchar](50) NULL,
    [Price] [numeric](18, 2) NULL,
    [ProductDetailID] ASC
ALTER TABLE [dbo].[ProductDetails]  WITH CHECK ADD  CONSTRAINT [FK_ProductDetails_Products] FOREIGN KEY([ProductID])
REFERENCES [dbo].[Products] ([ProductID])

ALTER TABLE [dbo].[ProductDetails] CHECK CONSTRAINT [FK_ProductDetails_Products]

If I go and check indexes on ProductDetails, I find that there is only one index, the index for the PK.


Why is this a problem? It’s because of performance. We should realize that indexes speed up performance by reducing the amount of work that SQL Server has to do.

With FK columns, what I’ve often found with child tables is that I know the value of the FK column I am searching for and don’t need to join with the parent table. However without an index on the FK column, this query requires a table scan.

, price
from ProductDetails pd
where pd.ProductID = 3

If you are creating FKs in your database, don’t forget to index them where appropriate.

Auto Creation

I’ve seen some people ask why SQL Server doesn’t automatically create indexes on those FK columns. I am torn on this, but I like the 80/20 rle. If 80% of the tables would benefit from it, I think it should be done. I am leaning towards some intelligent mechanism to do this.

The main issue is that you might not want just an index on the FK column. You might want some sort of covering index that includes columns in addition to the FK column to prevent key/bookmark lookups to the clustered index. If you can avoid those, you can drastically increase performance.

There is also the chance that with your query load, you never use these indexes. That can be horrible for performance as well since there is overhead to maintain these indexes on all insert/update/delete operations.

The Advice

Look at the queries that are coming into your database. Check the missing index DMVs and if you find that the FK columns are being used, index them.

If you’re not sure, or don’t know how to look for missing indexes, here’s a reference.

Common SQL Server Mistakes – Equals NULL

One thing that I don’t see a lot, but it still happens with people new to SQL Server is the comparisons they’ll make with NULL values. Often those people new to T-SQL will write this:

select CustomerID, CustomerName
from Customers
where SalesRepID = NULL

The thought here is they are looking for those customers that don’t have a salesrep assigned. Or they might enclose the NULL in quotes, but this won’t work.

The correct way to do this is:

select CustomerID, CustomerName
from Customers
where SalesRepID Is NULL

Note the “Is NULL” that will correctly return those customers who have a NULL value stored in that column.


NULL is an unknown value. We just don’t know what value it is, so it’s not a variable in algebra like “x”. In algebra, x=x, but NULL != NULL. Since we don’t know what the value is, and since each row could potentially have a different value (remember every NULL’s value is unknown) we can’t expect any NULL to equal any other NULL.

NULL isn’t a placeholder like a blank or space, or even zero. It’s an unknown value, so equals (and not equals) does not apply. Instead you need to use “Is NULL” or “Is Not NULL” for your comparisons.

Common SQL Server Mistakes – Indexing Every Column

If one index helps speed up queries, than more indexes should help more, right? They do, but they also come at a price. Both in performance during data modifications (insert/update/delete), and in terms of space since each index must be stored somewhere.

I have never bothered to index every column of a table. Actually I’m not sure if I’ve indexed every column of any table. Perhaps that’s because I originally came from a dBase/Clipper/Foxpro environment and I had to manage every index manually. Adding too many indexes resulted in a lot of coding.

In SQL Server, each index is useful in two ways. When someone puts a filter in the WHERE clause, like this:

  CustomerID, TerritoryID, AccountNumber
from Sales.Customer
where CustomerID = 10

then an index on CustomerID will speed up this query. Instead of having to scan all rows of the table, the index on CustomerID can be searched, the correct row found. If this is a clustered index, then the data can be read. If it’s a non-clustered index, the server can retrieve the rowID and then go get the data from the clustered index without reading all the rows in the table.

In more recent versions of SQL Server, multiple indexes can be used. For example,

  CustomerID, TerritoryID, AccountNumber
from Sales.Customer
where CustomerID = 10
  and TerritoryID = 23

In AdventureWorks, there are indexes on both CustomerID and TerritoryID for this table. It is possible (with lots of data), that the optimizer might choose to scan the CustomerID index for all matching rows and then the TerritoryID index for matching rows, join those results together to get a set of rows for the overall query and read the clustered index for those specific rows. However that’s not something you can count on in SQL Server. Typically one index is used in many queries.

So why not index every column?

First, if the columns are large, like varchar(max), text, or varbinary(max) columns, then it doesn’t make sense to build large indexes unless you often query these fields. Even then, a full-text index is likely a better choice.

Second, each time you change data (insert/update/delete), then all indexes must be updated at the same time. This means that your write performance suffers, and that impacts the server read performance as well since resources are being used to perform those updates. The more indexes, the more work that has to be done in support of any DML statement.

Lastly, you typically find that most of the time you query a table based on 3-5 fields, and those are the best candidates for indexes. For transactional tables, this is typically the number of indexes that you want to put on each table. Reporting tables, or OLAP type tables, might have more indexes, but these are tables that typically receive mostly read activity, and rare write activity.

Which columns do you index? Pick those columns that often appear in your queries, and that are fairly selective. You can always query the missing index DMVs for help in choosing which indexes the optimizer things it might use.

Common SQL Server Mistakes – Multi Row DML Triggers

How often have you seen someone write a DML trigger like this:

create trigger mytrigger on Mytable for insert as

declare @id int
select @id = id from inserted

update xx set yyy = zz
where id = @id


There seems to be this common misconception that a trigger fires for each change in a row (insert/update/delete), and that’s not true. As noted in Books Online, triggers fire once for the insert/update/delete. Typically this is an implicit transaction for the statement. If you have multiple statements inside an explicit transaction, the trigger fires once for each insert/update/delete statement in the transaction.

That means that if I change two rows and I have the trigger above, I won’t get the behavior I expect from the trigger. Let’s say that I want to update my inventory table each time I change an order. Imagine that I have this in my orders table:


and this in my inventory table.


If I now write this trigger:

alter trigger orders_update_inventory on orders
  for update

  declare @qty int
     ,    @product int
  select @qty = a.qty b.qty
    from inserted a
       inner join deleted b
         on a.orderid = b.orderid
  select @product = productid from inserted

  update inventory
   set onhand = onhand @qty
   where productid = @product


and execute this:

update orders
  set qty = 2
   where orderid = ’59CD85CE-984C-4D33-9E23-5F6159848277′

I will find that my inventory table looks like this:


That appears to work, but what happens if we execute this?

update orders
  set qty = qty+1
  where[CustomerID int] = 2

Then we find that the inventory is


In this case productID 1 has had its inventory reduced by 1, but not product ID 2. Why not?
When the trigger fires, there are actually this data in the tables:

OrderID  OrderDate  CustomerID int qty         productid
——– ———- ————– ———– ———–
59CD…    2010-09-22 3              3           1
01B6…    2010-09-22 2              2           2

and deleted

OrderID  OrderDate  CustomerID int qty         productid
——– ———- ————– ———– ———–
59CD85…  2010-09-22 2              2           1
01B66E…  2010-09-22 2              1           2

However the trigger, in setting the variables to the result of a query could have picked either of the rows, but only one row. SQL Server doesn’t guarantee order without an ORDER BY, so either product ID could have been chosen. As a result, only one of the products had the inventory updated.

A proper trigger would look like this

alter trigger orders_update_inventory on orders
  for update
  select @qty = a.qty b.qty
    from inserted a
       inner join deleted b
         on a.orderid = b.orderid
  select @product = productid from inserted

  update inventory
   set onhand = onhand ( a.qty b.qty)
   from inserted a
       inner join deleted b
         on a.orderid = b.orderid
   where inventory.productid = i.productid


Triggers should always be written to handle multiple rows, using the inserted and deleted tables for joins instead of variables. Even if you always just update single rows, coding this way will prevent issues if there is a multiple row change.

Common SQL Server Mistakes – Shrinking Databases

I don’t like there being an easy command to shrink databases, and I especially don’t like seeing the shrink option as a part of the default maintenance plans.
However it seems that this technique for managing sizes is used quite often, and even given as advice by some people. A few comments about this feature:
First, don’t regularly shrink databases. Actually, don’t shrink databases at all if you don’t understand what it does. Paul Randal, who managed the storage engine team, wrote a blog about why not: Here’s a good reason not to run SHRINKDATABASE. The bottom line is that this fragments your indexes, which raises reads and decreases performance.
If you are concerned about space usage, you have two choices: add less data or buy more space.
SQL Server database files aren’t like a Word or Excel file. They don’t allocate space on disk as it’s needed. Well, they do if you have autogrow turned on, but really the files and server expect to have free space in the data files for data growth, change to data (and potential page splits/new extent allocations), and for maintenance.
If you rebuild indexes regularly, and you ought to if they become fragmented, you need free space in your server. An index rebuild copies the entire index to a new, un-fragmented set of pages, and then drops the old index. So you need double your disk space for rebuilds.
Managing space proactively is something you should do, and that means that you want to leave a pad inside your data files to allow for data growth. If you don’t have enough disk space, buy more. You need the space for data, and for performance.

Transaction Log Files

Now the transaction log files are a slightly different story. You still want to size them correctly, and some good reasons from Mr. Randal on this. You should set your log file size based on the frequency of your backups. The backups are scheduled based on your risk tolerance. Basically, more frequent backups, less transaction log space needed.
However regularly shrinking your log files doesn’t introduce fragmentation, but it is dumb. Maybe not dumb, but it’s a waste of resources. Your server needs a t-log file size of xx to handle the regular activity on your server. Shrinking it at night and having it grow the next day to handle load is silly. And a waste of disk writes.
Set your log file, manage it as needed, don’t shrink it.

When to Shrink

So should you never shrink? No, you can shrink, but the feature there is for emergencies or one-time events. If I get a load of 500GB on my 1TB data once a year, I might get crazy log growth. I might plan for that by expanding my log in advance, and then shrinking the log afterward, back to the size that I normally use.
The same thing could occur in a database. Perhaps you move some data to a read only db and want to get the space down to data + largest index. Then you can shrink, rebuild indexes, and leave the log there. You can’t shrink to just data without fragmenting, so don’t try.
When you shrink, use SHRINKFILE, and target specific files, for a specific reason. Not as part of regular maintenance.

Common SQL Server Mistakes – SELECT *

I’ve been trying to work on some new presentations so that I have a variety, including some spares, when I go to events. One of the topics that I think has some value, especially for .NET and sysadmin groups, is a list of common mistakes, how to fix them, and why they’re bad.
I was going to call this Common Developer Mistakes, but I’m not sure that would go over well at Developer events, and I see DBAs making these mistakes along with Windows admins.
I decided to build a series of blog posts as I work through the presentation to document some of the issues, and help me work through speaking points. Please feel free to comment.
SELECT * Is For Short Term Use Only
The first mistake that I often see in application code is that too often people write things like

FROM Sales.Customer

.csharpcode, .csharpcode pre{font-size: small;color: black;font-family: consolas, “Courier New”, courier, monospace;background-color: #ffffff;/*white-space: pre;*/}.csharpcode pre { margin: 0em; }.csharpcode .rem { color: #008000; }.csharpcode .kwrd { color: #0000ff; }.csharpcode .str { color: #006080; }.csharpcode .op { color: #0000c0; }.csharpcode .preproc { color: #cc6633; }.csharpcode .asp { background-color: #ffff00; }.csharpcode .html { color: #800000; }.csharpcode .attr { color: #ff0000; }.csharpcode .alt {background-color: #f4f4f4;width: 100%;margin: 0em;}.csharpcode .lnum { color: #606060; }

What does this do? If I run this in my 2008 AdventureWorks database, I get something like this:


You can see that I end up with multiple columns (CustomerID, TerritoryID, CustomerType, rowguid, ModifiedDate). That’s handy, and cool, and allows me to get all the data in the table.

But do I really need it?

In most applications, my guess is that we don’t. Why do we need TerritoryID? That’s a foreign key to the SalesTerritory table, and typically what I want instead is the SalesTerritory.Name column instead of the ID value.

I could do this:

FROM Sales.Customer c
INNER JOIN Sales.SalesTerritory t
ON c.TerritoryID = t.TerritoryID

but that’s any better. Now I’ve returned even more columns, 10 more to be exact, including TerritoryID twice, once from each table. In AdventureWorks, this is 19k rows, and at a minimum this query has returned 19k rows x 8 bytes (int data type) too much data. That doesn’t sound like a lot, but what if this runs in your application 500 times a day? That’s a lot of wasted:

  • bandwidth
  • disk access
  • memory from caching
  • CPU work on the server AND client

I would also guess that most of the time when you access a customer, you don’t even want all the rows. Likely you want to filter this somehow, and you will with a WHERE clause, but it’s still wasted time and resources.

We know that the database often is a bottleneck. It’s a shared resource, it’s one machine, and it doesn’t scale as easy as multiple clients or web servers, or even developers, so we should avoid wasting resources when we don’t have to.

What Do You Do?

Here’s what I recommend:

You can write this, and it’s what I often do:

FROM Sales.Customer

And I get a limited result set:


Why is this better? I do this so I can easily see the column names. I can then include those in my SELECT statement, with a quick rewrite.


I could also quickly use the Object Explorer to find the columns like this:


And you can right click, and choose “script” and “as SELECT”


and paste the code into your query window. The results would look something like this:


Alternatively, my employer, Red Gate Software, makes a fantastic product called SQL Prompt that will help you quickly grab columns. For me, I can do an SSF, and get a SELECT * FROM and then choose the table:


Not that I see the columns to the right. I could also just select the table with a tab and then if I remove my askterisk, I get a list of columns I can easily pick:


SQL Prompt makes this easier, but it isn’t that hard to just do this by hand. You could easily grab the columns you need from SSMS and add them to queries.

The database is a limited resource, even if you have a 256 core server with 1024GB of RAM. You still want to query the data you need and only return what’s necessary. A little more effort when building code will pay off later with much better performing applications.

A few links from other people that see this as an issue as well.