Monday, August 08, 2011

Info and Usage of new COLUMNSTORE indexes

COLUMNSTORE indexes is one of the new features of SQL Denali which is currently in CTP3 and available for public consumption.

COLUMNSTORE indexes are nonclustered indexes store data in a column by itself in its own pages, a strategy to make certain types of access to this data in an isolated environment more efficient.  This cost benefit scales with massive rowcounts.

COLUMNSTORE indexes are intended for data warehouses or dedicated OLTP reporting systems, NOT traditional OLTP applications.  There is staggering limitation in the use of COLUMNSTORE indexes that force your hand when considering where to use them.

COLUMNSTORE indexes make your table read-only


Msg 35330, Level 15, State 1, Line 1
INSERT statement failed because data cannot be updated in a table with a columnstore index. Consider disabling the columnstore index before issuing the INSERT statement, then rebuilding the columnstore index after INSERT is complete.

That is to eliminate the overhead that would be required to maintain the structure at scale.  You can drop, create, and rebuild a COLUMNSTORE index just like you'd do on a "regular" nonclustered index.

How does this impact SQL Server maintenance plans?  

You cannot perform a COLUMNSTORE  index REBUILD with the ONLINE = ON option, or performa  REORGANIZE on a COLUMNSTORE index.

A "Reorganize Index" step in a SQL maintenance plan will skip a COLUMNSTORE index, so that the reorg won't fail.  Also new in the Denali "Rebuild Index" step is a radio button that allows you to request indexes to be rebuilt online but - and this is the new part - to NOT rebuild the index if it doesn't support an ONLINE rebuild.  COLUMNSTORE isn't the only thing that won't rebuild online - a lot of other index types (such as xml) and data types (such as varchar max) won't rebuild ONLINE.  Again, this is great for maintenance plans - they won't choke on known issues with index maintenance options like WITH( ONLINE = ON).

(The next obvious evolution of maintenance plans is to easily perform the nightly maintenance more intelligently  per MS's own recommendations - anything with a 30% or greater fragmentation level, rebuild online.  Anything that won't rebuild ONLINE, reorg.  Doing an offline REBUILD, even in the middle of the night, just isn't a possibility in high-pressure SQL Servers.  The offline REBUILD will hang and block someone notices in the morning.)

Except that in Denali CTP3, a REBUILD maintenance plan with ONLINE task does NOT skip a COLUMNSTORE index.  It bombs.

This is a reported bug on MS Connect.

Executing the query "ALTER INDEX [IDX_NC_cstest] ON [dbo].[cstest] REBU..." failed with the following error: "alter statement failed because the operation cannot be performed online on a table with a columnstore index. Perform the operation without specifying the ONLINE option or drop (or disable) the columnstore index before performing the operation using the ONLINE option."

Let's dig into the some of the performance differences between traditional nonclustered indexes and COLUMNSTORE indexes, keeping in mind the read-only limitation.


Versus


You can create both a nonclustered index and a nonclustered COLUMNSTORE index on the same column.  The optimizer will choose which one to use based on situations.  Queries in a scenario that present situations typically requiring a seek will continue to perform a seek.

In this scenario, I have a query that selects roughly 1/3 of the values out of a narrow 12 million row table with an integer IDENTITY(1,1) column as the primary key clustered index and a single column [cschar] varchar(10).


CREATE NONCLUSTERED INDEX IDX_NC_cstest ON dbo.cstest (cschar)

go
select  cschar from cstest where cschar ='a'
go

CREATE NONCLUSTERED COLUMNSTORE INDEX IDX_CS_cstest_cschar ON dbo.cstest (cschar)
go
select  cschar from cstest where cschar ='a'



With a cost of 2.83 vs the 11.01 of the traditional nonclustered index, the Columnstore Index clearly won, even though it chose to perform a scan instead of a seek.

Here's where the COLUMNSTORE can really show off though.  Lets add a simple aggregate:

DROP INDEX IDX_CS_cstest_fragtext ON dbo.cstest
go
select avg(len(cschar)) from cstest
go
CREATE COLUMNSTORE INDEX IDX_CS_cstest_fragtext ON dbo.cstest (cschar)
go
select avg(len(cschar)) from cstest
go


(click to enlarge)

The cost of the nonclustered index scan and its following actions: 25.9
The cost of the COLUMNSTORE Index Scan: 5.69

Again, those tests were performed on a dataset of 12.5 million rows, 4.2 million of which had the value of cschar = 'a'.

Lets try a data set where we would expect a SEEK to an obvious choice versus a SCAN.

DROP INDEX IDX_CS_cstest_cschar ON dbo.cstest
go
select top 1 id from cstest where cschar = 'a'
go
CREATE COLUMNSTORE INDEX IDX_CS_cstest_cschar ON dbo.cstest (cschar)
go
select top 1 id from cstest where cschar = 'a'


The optimizer decided to do a scan given the availability of the COLUMNSTORE index, but the result wasn't  better.

Cost of the traditional nonclustered index seek execution plan: .0032832
Cost of the COLUMNSTORE nonclustered index scan execution plan: .0032876

A curious choice by the optimizer but this is showing what the COLUMNSTORE isn't meant for - picking the needle out of the haystack.  Keep this in mind when putting COLUMNSTORE on your tables.

One more example.  Lets make it painfully obvious that we want a SEEK by deleting all but one of the cschar = 'a' records.




DROP INDEX IDX_CS_cstest_cschar ON dbo.cstest
go
delete from cstest where id <> (select min(id) from cstest where cschar = 'a') and cschar = 'a'
go
select cschar from cstest where cschar ='a'
go
CREATE COLUMNSTORE INDEX IDX_CS_cstest_cschar ON dbo.cstest (cschar)
go
select cschar from cstest where cschar ='a'

Talk about needle in a haystack, now we are looking for one record among 8,388,613.


Like we said before - even with the COLUMNSTORE index in place, the optimizer will choose to use an existing traditional nonclustered index if it makes more sense.

How about using the FORCESEEK and FORCESCAN table hints, introduced in SQL 2008 R2 SP1?

Even in a dataset where a query would use the COLUMNSTORE index, specify FORCESEEK and the optimizer will choose to use the traditional nonclustered index.  That makes it pretty obvious what SQL wants to do with a COLUMNSTORE index.


Open to feedback and more thoughts on COLUMNSTORE indexes.


“Consider the momentous event in architecture when the wall parted and the column became.” -Louis Kahn

No comments: