Tuesday, September 03, 2013

Actual Emails: TL;DR: Stop using varchar(max)

Wrote this email recently to a crew of developers who were shooting themselves in the foot with a database rich in varchar(max) data types.

Hey folks-
TL;DR: Stop using varchar(max). We’re not storing books.
We need to review and avoid the varchar(max) data type in our tables. Here’s a short treatise as to why.
In SQL Server, varchar(max) is intended to replace the old text data type, which was different from varchar(n) because it was designed to store massive amounts of data.  Massive being greater than 8000 characters, and all the way up to 2gb worth of data. That’s what the varchar(max), varbinary(max), and nvarchar(max) data types are optimized for – HUGE blocks of data in a single cell. We should only use it if we're actually intending to store massive text and use the fulltext indexing engine (a completely separate and specific topic for text blocks).
This is an oversimplification, but varchar(max) is designed to store data differently, and specially for large text blocks. It appears to behave the same as a varchar(n) field, and that’s deceptive when we are throwing 100-200 characters in each row field.
The big drawbacks biting us right now about varchar(max) have to do with indexing, and this is regardless of how much data is actually in a varchar(max) field. A varchar(max) column can’t be the key of a nonclustered index, even if it never stores more than 8000 characters, and can’t have ONLINE index maintenance performed.  As a result, it is generally a giant pain for indexing, a pain you only want to put up with if you absolutely have to.
Furthermore, we’re doing ourselves a disservice for performance, straight up. Unless you’re storing books, (max) hurts performance. Check out this blog post: http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/ 
In short, varchar(max) is burdensome overkill for our datasets.
So here’s the solution… Change varchar(max) to varchar(n), where n is an generous but appropriate number for that column’s data. If Excel creates varchar(max) columns for us when performing a data import wizard, change them to varchar(8000), which is the highest number you can assign to a varchar field.  Or better yet, once the data is in SQL, use this simple syntax to find out the max length of a column and then pad it.
For example: select MAX(LEN([yourcolumn])) from yourtable 
Problem is, our SSIS packages are all very picky about the data types and will break if we just change the data types. So, after making these table changes, you’ll need to open your SSIS package, open the data flow destination or other object, hit OK to apply the new metadata, save and deploy it again. No actual changes necessary.
This all came up because we have data quality issues with the fields Foo and Bar. Both of those columns are varchar(max). I’m dumping the varchar(max) data into temp tables with varchar(200) to get the queries to return in a reasonable amount of time. 
Let me know if you have any questions!

I like to use the word treatise to prepare my audience for verbosity.