# Hashing Collisions

One of the problems with hashing is that you can have collisions from values that are not very similar. This means that if you are using hashing as a way to identify similar values, you need to make further checks with the original data after the hash matches are gathered.

This post will show a few examples of the collisions that can occur if you use the CHECKSUM() or BINARY_CHECKSUM() functions.

If we examine this code:

-- Checksum declare @i varchar(200) , @j varchar(200); select @i = 'LE'; select @j = 'AAAAAAAAAAAAAAAALE'; select Plaintext = @i , checksum = CHECKSUM(@i) UNION ALL SELECT Plaintext = @j , checksum = CHECKSUM(@j); GO

This returns a result like this:

Note that these two values are the same as far as the checksum hash goes.

If we switch to BINARY_CHECKSUM(), we can get similar results.

-- binary_checksum is no better declare @i varchar(200) , @j varchar(200) , @k varchar(200); select @i = 'LE' select @j = 'Ou' select @k = 'MU' select Plaintext = @i , BINARY_CHECKSUM(@i) UNION ALL SELECT Plaintext = @j , BINARY_CHECKSUM(@j) UNION ALL SELECT Plaintext = @k , BINARY_CHECKSUM(@k) GO

While these two functions can be useful, you do have to be careful with the results. A matching hash from these functions does not mean that the source data is the same.

Comments are closed.

Then which occasions these 2 functions will be useful in realtime?

There are still multiple uses, but since I am mainly focussing on data warehousing I will give some examples there:

For example they are useful when you have a dimension with many attributes where only a few of them have to be a 100% correct and the others are allowed to be an older version of the data in case of a collision. In that case you could do a compare on the first few columns separately and then compare all the other ones with a checksum. This can potentially save a lot of memory in SSIS which increases data throughput.

Another one is when you have incoming data with many rows that you want to spread into multiple streams but you’re lacking a numeric value to use for this (a common practise to split a data stream is by using the modulo operator). Now you could use the checksum/binary_checksum to generate a numeric value which you can use together with the modulo to spread the data into multiple the streams.

There has a lot o research on hashing for databases and it is worth a look in the academic journals. A perfect hash is a function that produces a one-to-one mapping for unique values/ It is always possible to have a perfect hashing function for a fixed set of values. A minimal perfect hashing function is a perfect function that has no gaps in the hash values.

The advantage of a perfect hash over traditional tree indexes is that it is always one probe to fetch data no matter how large and diverse the data set A tree index has to travel down the levels of the tree to fetch data. As the size of the data set increases, so does the depth of the tree and therefore the number of probes.

SQL’s default hashing functions are notoriously simplistic in my opinion, and experience far more collisions than alternatives (such as the Fowler-Noll-Vo 1a implementation). Have you personally experienced and have any suggestions for alternative hashing functions with lower collisions rate and a good data type?

The FNV1a CLR implementation I use can be found at: http://www.sqlservercentral.com/articles/CLR/76962/

It certainly experiences alot less collisions than any numerical SQL hashing algorithm (at least on the data sets I tested it on), and since its a CLR, it can be computed in code, and passed to SQL already computed.

Obviously we could always go for a 64bit (or 128bit) implementation of something like MD5, but clearly thats not going to give you any major performance benefit (due to the required storage type for these hashing algorithms).