Tuesday, September 03, 2013

Actual Emails: TL;DR: Stop using varchar(max)

Wrote this email recently to a crew of developers who were shooting themselves in the foot with a database rich in varchar(max) data types.

Hey folks-
TL;DR: Stop using varchar(max). We’re not storing books.
We need to review and avoid the varchar(max) data type in our tables. Here’s a short treatise as to why.
In SQL Server, varchar(max) is intended to replace the old text data type, which was different from varchar(n) because it was designed to store massive amounts of data.  Massive being greater than 8000 characters, and all the way up to 2gb worth of data. That’s what the varchar(max), varbinary(max), and nvarchar(max) data types are optimized for – HUGE blocks of data in a single cell. We should only use it if we're actually intending to store massive text and use the fulltext indexing engine (a completely separate and specific topic for text blocks).
This is an oversimplification, but varchar(max) is designed to store data differently, and specially for large text blocks. It appears to behave the same as a varchar(n) field, and that’s deceptive when we are throwing 100-200 characters in each row field.
The big drawbacks biting us right now about varchar(max) have to do with indexing, and this is regardless of how much data is actually in a varchar(max) field. A varchar(max) column can’t be the key of a nonclustered index, even if it never stores more than 8000 characters, and can’t have ONLINE index maintenance performed.  As a result, it is generally a giant pain for indexing, a pain you only want to put up with if you absolutely have to.
Furthermore, we’re doing ourselves a disservice for performance, straight up. Unless you’re storing books, (max) hurts performance. Check out this blog post: http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/ 
In short, varchar(max) is burdensome overkill for our datasets.
So here’s the solution… Change varchar(max) to varchar(n), where n is an generous but appropriate number for that column’s data. If Excel creates varchar(max) columns for us when performing a data import wizard, change them to varchar(8000), which is the highest number you can assign to a varchar field.  Or better yet, once the data is in SQL, use this simple syntax to find out the max length of a column and then pad it.
For example: select MAX(LEN([yourcolumn])) from yourtable 
Problem is, our SSIS packages are all very picky about the data types and will break if we just change the data types. So, after making these table changes, you’ll need to open your SSIS package, open the data flow destination or other object, hit OK to apply the new metadata, save and deploy it again. No actual changes necessary.
This all came up because we have data quality issues with the fields Foo and Bar. Both of those columns are varchar(max). I’m dumping the varchar(max) data into temp tables with varchar(200) to get the queries to return in a reasonable amount of time. 
Let me know if you have any questions!

I like to use the word treatise to prepare my audience for verbosity.


Anonymous said...

The author is giving good advice--but misses one critical point.

What do you do when a field is intended to hold comments? Sure, 95% of comments will never even approach 8k in size, but what about the 10% that do (cumulative comments over time that need to be accessed and stored as a single reference and not rows of comments).

I only use varchar(max) for comments and notes, things I don't intend to index and that are rarely if ever the target of a WHERE clause.

Thing is, you never know when you create a comment field if it's going to exceed 8k.

Usually you don't need varchar(max). But when you do nothing else is acceptable.

Anonymous said...

Set a reasonable limit on the comment field that tells them how many characters they have remaining.

Anonymous said...

"(cumulative comments over time that need to be accessed and stored as a single reference and not rows of comments)." What creates this "need"? Why can't comments be in a one to many table that may even include a date stamp field? You may get one or two frustrated users who have to add a continuation record to their novella comments, but that's better than frustrating most of your users with the spinning hula-hoop when they try to do anything with the comment field.

Anonymous said...

Um, math? If I have a table with 105% of the rows being comments, I'm in some kind of alternate universe where three parellel lines join at right angles to form a perfect square and it smells like the color seven.

/close sarcasm

From what I've seen on comment boards of all kinds, a reasonable comment need never exceed 8k characters (not includeing code samples, of course).