Jun 7, 2006

Eliminating duplicate rows in a table - SQL Server 2000

There are quite a few number of ways by which we can delete the duplicate records from a table. I am just discussing 3 methods which I find interesting. I don't take ownership for any of these methods. I had seen these implementations put to use at different places. And I am just compiling it so that you get the different options in one place and chose the best one for yourself..

Actually, most of my posts don't have large explanation, just examples. Thats the fastest way anything gets into my head :) And I feel that, by the time you comprehend the example, you would be good enough to explain the logic to someone.

So, here it goes. I have a table with duplicates :)

create table tdup (a int, b int)
go
insert into tdup
select 1,1
union all
select 1,2
union all
select 1,3
union all
select 1,3
union all
select 1,2
union all
select 1,2
union all
select 2,1
union all
select 3,1



Now I want to eliminate the duplicate rows from tdup. The methods are given in the order of my preference.

Method 1:

Use a temp table - Clean and simple.

-- Select the distinct of the duplicate rows
select * into #tdup from tdup group by a,b having count(*) > 1

-- Delete all the rows which have duplicates
delete from x
from tdup x, #tdup y
where x.a = y.a and x.b= y.b

-- Insert the distinct copy back to the table
insert into tdup select * from #tdup

drop table #tdup

Method 2:

Use a cursor - No temp table

declare @a int,
@b int,
@rowcnt int

-- Create a cursor that will loop through those rows that has duplicates.

declare c1 cursor for
select a,b,count(*)
from tdup
group by a,b
having count(*) > 1
open c1
fetch next from c1 into @a,@b,@rowcnt

while @@fetch_status = 0
begin
--I set the row count to be one less than the total number of duplicates for a row and
--issue a delete statement

set @rowcnt = @rowcnt - 1
set rowcount @rowcnt
delete from tdup
where a=@a and b= @b
fetch next from c1 into @a,@b,@rowcnt
end
close c1
deallocate c1

set rowcount 0


Method 3:

Change the table schema - Do not use it unless you have a compelling reason. I have added this method just for reference.


--Add an Identity column
alter table tdup add id int not null identity(1,1)

--Now each row is distinct. Removing the duplicates is easy
delete from tdup
where tdup.id not in (select min(id) from tdup group by a,b)

-- Remove the Id column
alter table tdup drop column id


If anyone knows of a better way on eliminating duplicates or improving the above three methods, Let me know. I will add it here with credits :)

6 comments:

Anonymous said...

It works fine. I used first method and did it in 1 minute. Thanks for it.

Anonymous said...

Am I misreading the code or does example one remove all records that have duplicate values?

I like the second method better, since it looks like I could just skip deleting the first dupe in the list in order to preserve one copy of the dupe. What do you think?

Omnibuzz said...

Jim,
The first method does remove all the records but inserts the distinct records back. If you look at the comments it might help... but the general flow is this:

1) Take a distinct copy of all the records that has duplicates.
2) Delete all the records that are duplicated.
3) Insert the distinct records back into the table.

I prefer the first method better than the second because I'd love to use set based operations whenever possible instead of cursors. But, that again is a matter of opinion and comfort level.

Archana said...

thanks a lot for all these 3 method, i think it will work much bettet than ever in my interview

Gayathri said...

All 3 method r excellent.
But I dont want to delete the rows I just want to select them
Can u please suggest something for that
thanks in advance

Gayathri said...

All three methods r excellent.
But I want to select them instead of deleting.
Can u please suggest me something for that
Thank you in advance

Post a Comment