SQL Tact: datatypes

Showing posts with label datatypes. Show all posts

Monday, May 11, 2015

Fresh New Horror: PK on a Float Data Type

Here's a new thing to be scared about in the night, as if you weren't already terrified of floats after my previous two posts about the ills of the approximate-number data types.

Creating a primary key on a float data type can lead to what appear to be duplicate primary key values. Attempting to convert these values from float to an integer data type results in a primary key violation.

In fact, that's how we found it at the client site. Having already made the discovery that floats were to blame for randomly-appearing approximation problem that surfaced as business rule issues, we set about altering the column's data type to decimal (19,1). The error we received was "Violation of PRIMARY KEY constraint 'PK_invoicehdr'...".

Here's the grisly proof of good, well-intentioned people doing very bad things to an innocent database.

Parental Guidance: May contain some disturbing images. Viewer discretion advised.

Exhibits 1 and 2 above, we have the primary key declared on the invno column, declared as a float data type. (If it wasn't already very clear, this is a really bad idea.)

You ready for it?

There is nothing wrong with your television set. Do not attempt to adjust the picture. This is very real.

Pretty bad, right? Remember, this is their PRIMARY KEY.

Let's recreate the problem by manually inserting data that would conflict, or rather, would not conflict.

Turns out, in a table with millions of rows, we found only three instances where this was happening. I'm shocked that there weren't more - Murphy's Law ensured it happened, but it should have led to more than three times out of millions. Still, pretty bad.

Not surprisingly, this has lead to a pretty significant business rules problem.

Still working out the details of the fix to the database to recode those numbers before we try to convert the data types to decimal. And still working on changes to the offending .NET code (also probably using float data types).

Wednesday, April 23, 2014

Real heavyweights: Float vs Decimal, the Thrilla in Precision

This is a followup to a previous post where I likened SQL Server float datatype to Muhammad Ali. Specifically, I wanted to address the phrase "negligible data storage increase" to the test for different numeric data types.

In short, I will dismiss the notion that float provides a realistic advantage in storage of large numbers. (Also, I had a blog title idea too catchy to pass on.)

I'll avoid recapping in total my earlier post about the dangerous consequences of rounding with the float data type. The float data type is still popular with developers and database designers because of its supposed advantages in data storage size. (Also, some code-first modelers use float as a default. Also dangerous. And lazy.)

Too often, developers and database designers ignore the large potential for inconsistent rounding with float (and real) data types, whereas a decimal(p,s) data type is not only consistent with storage, but it provides transparent logical intent to the development team. No surprises with truncation of numerals to the right of the decimal point with the decimal data type.

Here's my test lab for a number with 3 digits to the right of the decimal point.

create table realtest (real1 float(24))
create table floattest (float53 float)--float(53) is the default
create table decimal9test (decimal9 decimal(9,3))
create table decimal19test (decimal19 decimal(19,3))
go
insert into realtest (real1) Values (123456.123)
insert into floattest (float53) Values (123456.123)
insert into decimal9test (decimal9) Values (123456.123)
insert into decimal19test (decimal19) Values (123456.123)
go
insert into realtest (real1) select real1 -1. from realtest
insert into floattest (float53) select float53 -1. from floattest
insert into decimal9test (decimal9) select decimal9 -1. from decimal9test
insert into decimal19test (decimal19) select decimal19 -1. from decimal19test
go 21
go
select sizemb = SUM(sizemb), tablename from (
select SizeMb = (p.in_row_reserved_page_count*8.)/1024.,
tablename = o.name, indexname = i.name
from sys.dm_db_partition_stats p
inner join sys.objects o on p.object_id = o.object_id
inner join sys.indexes i on i.object_id = o.object_id and i.index_id = p.index_id
where o.is_ms_shipped = 0
) x
where tablename in ('floattest', 'realtest', 'decimal9test', 'decimal19test')
group by tablename
order by sizemb desc
go
select top 1 * from realtest
select top 1 * from floattest
select top 1 * from decimal9test
select top 1 * from decimal19test
go

First, the resultsets showing the storage of 2+ million rows in a single-column table.

tablename	row_count	sizemb
realtest	2097152	26.382812
floattest	2097152	34.445312
decimal9test	2097152	28.382812
decimal19test	2097152	36.507812

Here's the values that were actually stored, notice the important differences.

real1

123456.1

float53

123456.123

decimal9

123456.123

decimal19

123456.123

You can see that the ~10mb we saved with the real (or float(24)) data type isn't much good because of the catastrophic loss of precision. And even with 2 million rows, we've saved roughly 2mb of space by using float.

Now, let's run the same test, instead with the value 123456.1234567 instead. (The Decimal9 table has been eliminated from this result because it can only store a number with scale 9.)

create table realtest (real1 float(24) )
create table floattest(float53 float) --float(53) is the default
create table decimal19test(decimal19 decimal(19,7) )
...

tablename	row_count	sizemb
realtest	2097152	26.382812
floattest	2097152	34.445312
decimal19test	2097152	36.507812

real1

123456.1

float53

123456.1234567

decimal19

123456.1234567

Note again that real is wholly unacceptable with its transparent loss of precision, while float and a properly-aligned decimal data type store the data appropriately and yet without a significant difference in storage (less than 6% difference).

Now, to slam the door on float.

Let's apply DATA_COMPRESSION = PAGE to this data.

create table realtest (real1 float(24) ) with (data_compression = page)
create table floattest(float53 float) with (data_compression = page)
create table decimal19test(decimal19 decimal(19,7) )with (data_compression = page)
...

tablename	row_count	sizemb
floattest	2097152	26.382812
decimal19test	2097152	24.320312

We can clearly that page compression reverses the small advantage that floattest enjoyed.

Since page data compression is a very popular, very widespread and very useful tool on modern Enterprise edition SQL Servers, we can rule out float as having any space saving advantage. In short, float loses the last ground it stood on.

The real and float data type columns only compressed by 15% and 23% respectively. The fixed decimal data type compressed, in this case, by 43.5% (36.5mb to 24.3mb).

If we scale the page compressed data comparison up to 134 million rows...

tablename	row_count	sizemb
floattest	134217728	1686.070312
decimal19test	134217728	1556.015625

...we confirm that floattest has lost its so-called space savings advantage, and is more than 7% larger than the same data in a proper decimal data type.

Tuesday, September 03, 2013

Actual Emails: TL;DR: Stop using varchar(max)

Wrote this email recently to a crew of developers who were shooting themselves in the foot with a database rich in varchar(max) data types.

Hey folks-

TL;DR: Stop using varchar(max). We’re not storing books.

We need to review and avoid the varchar(max) data type in our tables. Here’s a short treatise as to why.

In SQL Server, varchar(max) is intended to replace the old text data type, which was different from varchar(n) because it was designed to store massive amounts of data. Massive being greater than 8000 characters, and all the way up to 2gb worth of data. That’s what the varchar(max), varbinary(max), and nvarchar(max) data types are optimized for – HUGE blocks of data in a single cell. We should only use it if we're actually intending to store massive text and use the fulltext indexing engine (a completely separate and specific topic for text blocks).

This is an oversimplification, but varchar(max) is designed to store data differently, and specially for large text blocks. It appears to behave the same as a varchar(n) field, and that’s deceptive when we are throwing 100-200 characters in each row field.

The big drawbacks biting us right now about varchar(max) have to do with indexing, and this is regardless of how much data is actually in a varchar(max) field. A varchar(max) column can’t be the key of a nonclustered index, even if it never stores more than 8000 characters, and can’t have ONLINE index maintenance performed. As a result, it is generally a giant pain for indexing, a pain you only want to put up with if you absolutely have to.

Furthermore, we’re doing ourselves a disservice for performance, straight up. Unless you’re storing books, (max) hurts performance. Check out this blog post: http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/

In short, varchar(max) is burdensome overkill for our datasets.

So here’s the solution… Change varchar(max) to varchar(n), where n is an generous but appropriate number for that column’s data. If Excel creates varchar(max) columns for us when performing a data import wizard, change them to varchar(8000), which is the highest number you can assign to a varchar field. Or better yet, once the data is in SQL, use this simple syntax to find out the max length of a column and then pad it.

For example: select MAX(LEN([yourcolumn])) from yourtable

Problem is, our SSIS packages are all very picky about the data types and will break if we just change the data types. So, after making these table changes, you’ll need to open your SSIS package, open the data flow destination or other object, hit OK to apply the new metadata, save and deploy it again. No actual changes necessary.

This all came up because we have data quality issues with the fields Foo and Bar. Both of those columns are varchar(max). I’m dumping the varchar(max) data into temp tables with varchar(200) to get the queries to return in a reasonable amount of time.

Let me know if you have any questions!
William

I like to use the word treatise to prepare my audience for verbosity.

Tuesday, December 04, 2012

On the Advantages of DateTime2(n) over DateTime

Starting with SQL 2008, we database developers started becoming more familiar with datetime2. Sometimes folks need convincing though, so here goes.

Here's a brief review of how the precision of the datetime2 data type converts from a varchar representing a time value out to one ten-millionths of a second. Run this script yourself or view the results in the image below:

declare @datetime varchar(50) = '01/01/2012 11:11:11.1111111'

select @datetime

select convert(datetime2(0), @datetime)

select convert(datetime2(1), @datetime)

select convert(datetime2(3), @datetime)

select convert(datetime2(4), @datetime)

select convert(datetime2(7), @datetime)

select convert(datetime2, @datetime) --default is 7

Want to do the same conversion with datetime or smalldatetime? Can't.

Msg 241, Level 16, State 1, Line 10

Conversion failed when converting date and/or time from character string.

The old data types can't handle that much precision. Gotta dial it down for the old datetime and smalldatetime types. How quaint.

declare @datetime varchar(50) = '01/01/2012 11:11:11.111'

select convert(datetime, @datetime)

select convert(smalldatetime, @datetime)

select convert(datetime2(0), @datetime)

select convert(datetime2(1), @datetime)

select convert(datetime2(3), @datetime)

Note how the old data types are incapable of storing precision out to one one-thousandth of a second.

How about date ranges?

datetime: 1753-01-01 through 9999-12-31
smalldatetime: 1900-01-01 through 2079-06-06
datetime2: 0001-01-01 through 9999-12-31

Now the kicker. What's the cost to storing all that extra precision in datetime2? None. You can get more precision than datetime and fewer bytes per row per field by specifying a precision value for columns declared as datetime2(n).

For example, datetime(2) stores one hundreds of a second - realistically the same precision as datetime, which rounds the third place to the right of the decimal. And datetime(2) is two bytes smaller than datetime, making it ideal.

Don't need seconds, just hours and minutes? Stick with smalldatetime, 4 bytes, as opposed to datetime2(0) at 6 bytes.

Storage requirements

smalldatetime:
4 bytes - precision to the minute (seconds are always :00)

datetime2(n):
6 bytes for precisions less than 3 - precision up to one hundredth of a second
7 bytes for precisions 3 and 4 - precision up to one ten thousandth of a second
8 bytes for precisions > 4 - precision up to one ten millionth of a second (within 100 nanoseconds)

datetime:
8 bytes - precision to one hundredth of a second, rounded precision to three thousands of a second

Clearly, datetime2 is an upgrade in range of values, precision (no rounding!) and storage size over datetime.

And that's only if you need to store date and time info. Since SQL 2008, we've also been able to store mm/dd/yyyy data in the date data type (3 bytes), and discrete hh:mm:ss in the time data type (5 bytes).

Oh yeah, and even though datetime is not deprecated, this friendly yellow box might make you think so.

Note
Use the time, date, datetime2 and datetimeoffset data types for new work. These types align with the SQL Standard. They are more portable. time, datetime2 and datetimeoffset provide more seconds precision. datetimeoffset provides time zone support for globally deployed applications.

.net developers? Datetime and Datetime2(n) both map to System.DateTime. No worries there. More info here: http://msdn.microsoft.com/en-us/library/bb675168.aspx

HOWEVER, you should probably be using datetimeoffset, not just datetime2. DateTimeOffset works similarly to datetime2, but adds 2 bytes to store a +-hh:mm data. While this doesn't store the time zone (CDT, EST, etc.) it does store the offset. I've given a joint presentation on this topic with a colleague that goes into more detail. (update 7/31/2017)

Friday, March 18, 2011

Float like an approximation; stings like a bee

I distinctly remember having to tell my CIO and a VP of Finances in 2005 that the reason their reports were off by one cent from the ledger was because the person who designed their reporting database had chosen to use the float datatype instead of the decimal datatype. The implications of that one cent variation were astounding to me, but I wasn't a business finances guy.

»»»Click here to read more »»»

Wednesday, December 15, 2010

A word on the decimal data type

Was inspired to write up a few notes on the SQL Server data type. This is pretty basic stuff, but everyone should know it.

If you declare a column as DECIMAL, it will default to the precision and scale 18, 0. Not very useful, is it? You might as well have used a bigint column to store integer values like that. So first off, when you declare a decimal column, always give it precision and scale.

»»»Click here to read more »»»

Monday, April 12, 2010

LINQ doesn't like TINYINT PK IDENTITY fields

So, to save space (and it is a bit archaic, but why not) I sometimes use tinyint and smallint for the PK IDENTITY columns in small lookup tables.

For example, a list of counties or congressional districts for a state government project. Will never be more than 255 or negative. Tinyint. 1 byte instead of 4.

Yeah yeah, I'm aware that in the long run, that adds up to very little saved disk space. Very little. Still, I can't help but be tight with data type constraints. Its not like I will be allowing every column to be declared varchar(8000), much less varchar(max), even when "disk space is cheap". Its not like I'll be using bigint instead of int when I know I'll never get more than 2 billion records.

Except that when your developers want to use LINQ, you can't use a TINYINT IDENTITY column. Stupid error.

Changed it, begrudgingly, to smallint. 2 bytes instead of 4. Grrr...

Read more:
http://linqinaction.net/blogs/roller/archive/2007/11/28/linq-to-sql-doesn-t-support-identity-tinyint.aspx