Wednesday, February 05, 2014

Actual Emails: More Grief for GUIDs

Wrote this email exchange with some developer colleagues about to embark on performance tuning.
Subject: bag o' tricks from DMV talk
From: A. Developer
Hey William,
We are about to spend around 4 weeks performance tuning an application we built. Much of the performance tuning is going to be the database. Could you send over those dmv queries that you used in your DMV talk? Or do you still have them available for download? I think they will be able to help us out quite a bit.
I know one of the big things you talked about is not using GUIDs, however, we are using them because of replication. Do you have any thoughts on helping ease the fragmentation because of this?

From: William Assaf

Yep, I have all that info here. This is the last time I gave that talk:
Of course, if you have any questions, let me know.

So, as you know, I hate GUIDs because they are 4x as large as an integer yet serve the same purpose, or 2x as large as two integer columns to allow for multiple device synchronization.

But the biggest problem with GUIDs can happen when the first key of the clustered index of a table is a GUID column. With the creation of the new GUIDs, are you doing random GUIDs or sequential GUIDS?

If you’re creating them with a SQL default value (which you’re probably not, but as an example), this would be the difference between the newid() function (bad) and the newsequentialid() function (not as bad).

Using sequential GUIDs can allow you to create a clustered index that actually make some sense when it is ordered, and can have index maintenance performed on it to REBUILD or REORGANIZE, reducing fragmentation. Problem is, when you restart the SQL Service, the sequence also resets. So you won't have one contiguous string of sequentially-inserted GUIDs in a table over its lifetime. 

On random GUIDs, you can REBUILD or REORGANIZE all you want, but the data order still won’t make any sense. Changing from random to sequential GUIDs may be really easy to change in your database or application code. If you’re already using sequential GUIDs, there’s not really much more you can to do mitigate the performance and storage letdowns of GUIDs that you would not also do on tables that use integer IDENTITY columns.

As for basic indexing, run this script first to find any tables that are still heaps. Those should be the first things you target in your performance tuning phase.

If I can help or get in on that performance tuning stuff, let me know! Good luck, keep in touch.

From: A. Developer
Thanks for the info.
One question though. Will sequential GUIDs work with replication? Wouldn't there be collisions if they are sequential?

From: William

So, in theory, is it mathematically "unlikely" that GUIDs will have collisions. That math always troubles me when multiple machines are generating GUIDs, though the network card is used to generate a unique seed for the GUID on each machine. It is even more troublesome when GUIDs are generated sequentially, but we are supposed to trust the math. :/ 
I’m not super knowledgeable about that math but from what I know, your concern is understandable but “unlikely” to be an issue. I know, not comforting. Here is some comforting reference material though. 
Uses words like “practically guaranteed” and “virtually impossible” for sequential GUID collisions.

I welcome comments and additions to this somewhat religious topic. The "oil rig" scenario that is commonly faced in the Gulf South provides a recurring fresh argument for/against GUIDs. In many scenarios, a pair of integer fields can provide the benefits of sequential, unique keys. Assuming one of the pairs is also the rig/device/source of the data, you also have a built-in foreign key constraint in the unique key, something you would have to store anyway in a table keyed on a GUID.


@SQLTuna said...

I don't think it's fair to "hate" GUIDs. They have their uses and their pitfalls the same as anything else. For example, they're not guessable therefore they add a level of security. They can be generated outside the database, meaning potentially fewer database calls. They provide a global uniqueness, giving the ability to merge datasets with virtually no risk of clashes. Conversely, they're larger than an integer so take up more space, and will cause fragmentation if used as the clustering key etc.

They're extremely useful as an ID for large-scale sharded environments, where scaling up and down is a requirement.

Know when to use them, but don't hate them ;)

Anonymous said...

It is fair to hate a technology that you believe, based on experience and knowledge of that technology, is a poor choice yet is regularly chosen for all the wrong reasons with disregard for the consequences. When the same folks that choose the technology against general or even explicit advice, then ask for help in masking their poor choice, all the while touting the benefits to justify their choice, it makes one wonder how the folks making these decisions are allowed to keep their jobs. Something as fundamental as a data type which will underpin an entire system, and the problems GUIDs bring with them, it has GUIDs falling into that category for a lot of people.

Anonymous said...

My background (to add context to my post): I work in dimensional data warehousing in a SQL Server environment (SQL, SSAS, etc). What I have found over time is to listen to something that Joe Celko talks about: no automatic generation of a key field. As you (might) know, he completely dislikes IDENTITY, and I would believe NEWID() as well. The issue is that the field is not tied to the record in any meaningful way.

Once I actually understood the reason why he was right, I started on an approach that I've labeled as a "deterministic key generation". It is deterministic in that the same value is produced no matter when or what instance of SQL it was run on.

Having spent time researching this stuff on line, I found the best of all possible worlds was to surround a HASHBYTES (MD5) function with a CHECKSUM. The "seed" to HASHBYTES is the natural key of the record you are dealing with: in my environment it is typically a "standards" based name of some equipment. The CHECKSUM does its own HASH, but produces an INT that I then use as the surrogate key. I did a lot of testing of that to determine if I would run into any collisions (CHECKSUM isn't exactly the best HASH algorithm). The only situations where I did have collisions was when the seed names were mostly numbers. When it was a unique name with few numbers in it, it worked well. To date, I have not had a collision using that approach.

@SQLTuna said...

Seeing GUIDs used in the wrong way means you should educate those people who implemented them. Everyone makes mistakes throughout their career, and often they are valuable learning experiences and defining moments. Yes, those in senior positions shouldn't make those fundamental errors, but that said, it doesn't give reason to hate GUIDs. If a taxi driver kept turning up to your door on a motorbike, in the rain, would you hate motorbikes or would you educate the driver that a car might be better? Sorry for the poor analogy, but hopefully the message is clear.

It's an interesting approach you've mentioned. Checksum'ing a hash of the natural key does not produce a surrogate key. By definition it's not a surrogate, it's a derived value. You gain a level of data integrity as you can reproduce it, but lose a level of performance through the overhead of deriving the value. You'd still have to find a different clustering key too. But deriving a key is a good mechanism for partitioning data if you have a good distribution across the INT range (check out the Jenkins hash too as that's very effective).

Just to play Devil's Advocate - why does a field need to be tied to a record in any other way than its existence IN the record? All a record is is a collection of fields. A random key plays the same role as a generated one. The trick isn't giving that key meaning, it's finding the correct implementation to suit your requirements.

Anonymous said...

This will be short, I'm at work so I don't have a lot of time.

Regarding "By definition it's not a surrogate, it's a derived value". I tend to use "surrogate" more than I should, but in database lexicon, it is technically an Artificial Key. I pulled this out of a Celko discussion "A surrogate key is created by the database system for its own use and it is never exposed to the user. Artifical keys are created by the users and they are exposed. These two terms get confused." I have also referred to it as a synthetic key.

The first entry in this forum has a fairly decent coverage of why Celko dislikes IDENTITY (a randomly generated key):

The point he keeps making that most people miss is that SQL is used to define a logical model that implements SETs and relational algebra. IDENTITY is an artifical construct thrown on top of the model, and in some ways weakens or breaks the model (depending on how you use it).

The very subject of this blog posting is about dealing with implementing a weakened logical model: a randomly generated field added to the model causes implementation problems (CLUSTER INDEX) or creates issues with moving the model to another instance.

@SQLTuna said...

I admire your efforts in trying to implement a relational "utopia"!! Genuinely. That's the basis of Joe Celko's writing really - sticking with the SQL-92 standard and creating perfect logical models. In practice it's very difficult to do that, and in doing so you also miss out on the massively beneficial vendor-specific features. Yes, they tie you to that platform - but how often do you port solutions between platforms? I personally think the benefits outweigh the cost. And in fact a very good point is made in the comments of that post about Ralph Kimball, one of the leading theorists on Data Warehousing, who advocates the use of meaningless surrogate keys in dimensional modelling.

I guess we've moved away from the basis of this article, but it's certainly an interesting topic. There will always be people who sit in opposed camps on this, as there are arguments for and against. I'm an advocate of using the means at your disposal, and the ones correct for the job. If that means GUIDs, then I use GUIDs. Knowing the pitfalls of them mitigates the risk of using them, so for me the key is arming yourself with knowledge.

Perhaps you should put together an article on your approach to generating artificial keys. Putting a framework out there for dimensional modelling that is more closely aligned to the SQL-92 standard would make good reading.

w said...

Great discussion guys! Thanks for contributing, SQLTuna and "anonymous".

I regret publishing the word "hate" because I... dislike... that word, too strong for any case, much less a tech preference. :) So my apologies there, but it was in my original email to a colleague.

Aside from that regret, I have to reinforce that I prefer sequential numeric surrogate keys (so IDENTITY int or IDENTITY bigint if absolutely necessary) in all tables, in both normalized relational and dimensional designs. I'm not the one bringing him up, but I would disagree with Celko there.

I have heard the argument before that GUIDs provide a shield against incremental "guessing" of IDs, but I would counter that a surrogate primary key or primary key pair should never be exposed, on a screen or in a plain text cookie or URL path or otherwise, and that you may have bigger security problems if someone is able to sequentially scroll your data.

Thanks for keeping your discourse collegial and professional (and informative!), rising above the GUID fisticuffs that I've seen before. :)