Michael J. Swart

February 17, 2009

CHECKSUM_AGG, a very nifty function

Filed under: Technical Articles — Tags: , , — Michael J. Swart @ 7:55 am

So while poking around books on-line, I noticed the function CHECKSUM_AGG. Things I’ve noticed.

It only works on integer columns.
The first thing I noticed is that CHECKSUM_AGG only works on integers. In fact if you try to aggregate any other type of column, you’ll get an error like:

Msg 8117, Level 16, State 1, Line 1
Operand data type varchar is invalid for checksum_agg operator.

My disappointment lasted for four seconds before I realized that combining CHECKSUM_AGG with CHECKSUM works just as well.

CHECKSUM_AGG is just XOR.
So I noticed that during a sample of mine the CHECKSUM_AGG on a certain column would equal 0 when the number of rows was even. This led me to guess that CHECKSUM_AGG is just XOR. And sure enough, the following example demonstrates that it is:

CREATE TABLE #f (a FLOAT )
GO

INSERT #f VALUES (RAND())
GO 200

DECLARE @i INT
SELECT @i = 0
SELECT @i = @i ^ checksum(a) FROM #f
SELECT @i, CHECKSUM_AGG(CHECKSUM(a)) FROM #f
GO

DROP TABLE #f
GO

But is this a problem? Maybe. If there are columns that have a very limited range of values, then there is a greater chance for different row values to generate the same checksum. (In fact for bit columns, the chance is always fifty percent for arbitrary data no matter how many bit values are aggregated.) So if you’re using CHECKSUM_AGG as a quick way to determine whether changes have occurred in the data, then this solution has a chance of leading you to believe that there have been no changes when in fact there may have been.

What it’s useful for: QA
So after discovering a new tool, it would take a HUGE lack of curiosity for someone not to wonder what it can be used for. One thought naturally popped into my head. Software Development QA has a legitimate use for it. Especially for regression testing where we want to verify that an application behaves the same way as it did before:

PROS:

  • It’s quick, efficient and easy to implement.
  • It can be easily incorporated into a test suite or build process that supports db connectivity.

CONS:

  • It only gives yes/no answers as to whether data has changed. It will not give any other clues about where the data discrepancies are.
  • Because it’s based on XOR, it can give false positives when used to report that data in a table has not changed. With some effort I think this risk can be reduced.
  • The technique is also not very useful for columns such as rowversion, timestamp or datetime because those columns tend to vary with repeated tests. The same is true for identity columns. However, this is also true for any other technique of regression testing that compares data.

What it’s useful for: Schema comparisons

It’s also nice for keeping tabs on parts of a schema such as in this script:

DECLARE @check INT
SELECT @check = CHECKSUM_AGG(CHECKSUM(name)) ^ CHECKSUM_AGG(CHECKSUM(OBJECT_ID)) 
FROM sys.indexes

IF @check = 210550619
   PRINT 'schema has not changed'
ELSE
   PRINT 'schema has changed'
-- Note: This technique may not be as effective 
--       against sys.objects
--       because of automatically created statistics.

So now what?
Still with me? Great. So you may not need to use these immediately, but even so, it’s valuable as one trick to put into mental toolbelt.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress