pointers, solutions and scripts for the SQL DBA
Not intended to replace common sense

2/21/2012

Removing Identical Duplicate Rows

Deleting duplicate rows out of a table can be tricky.  A brute force way to do this is with a TOP 1 and a cursor, but clearly there are set-based ways to accomplish this.  Using the ROW_NUMBER function with an appropriate OVER is fast and simple to understand.


This is a simple lab you can run to test it out.  



use sandboxdb
go 
drop table wdupholder
go 
create table wdupholder
( text1 varchar(10) not null
, text2 varchar(10) not null
)  
go 
insert into wdupholder (text1, text2) values ('test1','test1')
,('test1','test1')
,('test2','test2')
,('test2','test2')
,('test3','test3')
,('test3','test3')
,('test3','test3')
,('test3','test3') 
go 
select * from wdupholder
Note that we have a number of rows that are completely identical to eachother - no way to tell them apart.
delete a from 
(select DupRank = row_number() OVER (Partition By text1, text2 order by text1 asc) 
from wdupholder) a
where duprank > 1
go 
select * from wdupholder
go 
drop table wdupholder
go
In that case, we had no way to tell the duplicate records apart.  In this next scenario, our job is easier.  We have duplicate records but we have an integer column that may have been some sort of autoincrement or serialized value.  In your database, this may be a [createddate] datetime field that could be used to differentiate duplicate fields. 


We want to delete all but the lowest value for the column [int1] of any set of duplicate rows.  This is easy to do by using the integer column in the ORDER BY of the OVER.  We could delete all but the most recent values, based on [int1], by changing the ORDER BY to desc.


You should be able to apply this simple lab to your environment.  
create table wdupholder
( text1 varchar(10) not null
, text2 varchar(10) not null
, int1 int null
) 
go 
insert into wdupholder (text1, text2, int1) values ('test1','test1','1')
,('test1','test1','2')
,('test2','test2','3')
,('test2','test2','4')
,('test3','test3','5')
,('test3','test3','6')
,('test3','test3','7')
,('test3','test3',null) 
go 
select * from wdupholder
go 
delete a from 
(select DupRank = row_number() OVER (Partition By text1, text2 order by int1 asc) 
from wdupholder ) a 
where duprank > 1
go 
select * from wdupholder

Note the behavior of the NULL value in the int1 column, and be aware of NULL values in your own dataset.

3 comments:

  1. --This can be done in one step with a CTE
    create table MyTable(
    a int null,
    b int null,
    c int null,
    d int null)

    insert MyTable
    select 1,2,3,4
    insert MyTable
    select 1,2,3,5

    select * from MyTable

    --Ed's magical dedupeme code
    ;With DeDupeMe As (
    Select
    a,b,c,
    Row_Number() OVER( Partition By a,b,c
    Order By d) as RowNumber
    From MyTable )
    Delete From DeDupeMe
    Where RowNumber > 1

    ReplyDelete
  2. Anonymous- thanks for replying. The delete in my second example and the CTE delete in your example return the same exact execution plan. Six of one, half a dozen of the other...

    ReplyDelete
  3. This is ann excelent article. I was looking for that code for a long time.
    Thanks a lot.

    Alex

    ReplyDelete

All comments are moderated before publishing. If you find something wrong or disagree, please comment so that this blog can be as accurate and helpful as possible.