Tuesday, February 21, 2012

Removing Identical Duplicate Rows

Deleting duplicate rows out of a table can be tricky.  A brute force way to do this is with a TOP 1 and a cursor, but clearly there are set-based ways to accomplish this.  Using the ROW_NUMBER function with an appropriate OVER is fast and simple to understand.


This is a simple lab you can run to test it out.  



use sandboxdb
go 
drop table wdupholder
go 
create table wdupholder
( text1 varchar(10) not null
, text2 varchar(10) not null
)  
go 
insert into wdupholder (text1, text2) values ('test1','test1')
,('test1','test1')
,('test2','test2')
,('test2','test2')
,('test3','test3')
,('test3','test3')
,('test3','test3')
,('test3','test3') 
go 
select * from wdupholder
Note that we have a number of rows that are completely identical to eachother - no way to tell them apart.
delete a from 
(select DupRank = row_number() OVER (Partition By text1, text2 order by text1 asc) 
from wdupholder) a
where duprank > 1
go 
select * from wdupholder
go 
drop table wdupholder
go
In that case, we had no way to tell the duplicate records apart.  In this next scenario, our job is easier.  We have duplicate records but we have an integer column that may have been some sort of autoincrement or serialized value.  In your database, this may be a [createddate] datetime field that could be used to differentiate duplicate fields. 


We want to delete all but the lowest value for the column [int1] of any set of duplicate rows.  This is easy to do by using the integer column in the ORDER BY of the OVER.  We could delete all but the most recent values, based on [int1], by changing the ORDER BY to desc.


You should be able to apply this simple lab to your environment.  
create table wdupholder
( text1 varchar(10) not null
, text2 varchar(10) not null
, int1 int null
) 
go 
insert into wdupholder (text1, text2, int1) values ('test1','test1','1')
,('test1','test1','2')
,('test2','test2','3')
,('test2','test2','4')
,('test3','test3','5')
,('test3','test3','6')
,('test3','test3','7')
,('test3','test3',null) 
go 
select * from wdupholder
go 
delete a from 
(select DupRank = row_number() OVER (Partition By text1, text2 order by int1 asc) 
from wdupholder ) a 
where duprank > 1
go 
select * from wdupholder

Note the behavior of the NULL value in the int1 column, and be aware of NULL values in your own dataset.

3 comments:

Anonymous said...

--This can be done in one step with a CTE
create table MyTable(
a int null,
b int null,
c int null,
d int null)

insert MyTable
select 1,2,3,4
insert MyTable
select 1,2,3,5

select * from MyTable

--Ed's magical dedupeme code
;With DeDupeMe As (
Select
a,b,c,
Row_Number() OVER( Partition By a,b,c
Order By d) as RowNumber
From MyTable )
Delete From DeDupeMe
Where RowNumber > 1

w said...

Anonymous- thanks for replying. The delete in my second example and the CTE delete in your example return the same exact execution plan. Six of one, half a dozen of the other...

Anonymous said...

This is ann excelent article. I was looking for that code for a long time.
Thanks a lot.

Alex