Skip links

SQL – Finding and Removing Duplicates

With the use of a CTE and the ROW_NUMBER() ranking function, you can easily find, and delete, any duplicates in a table.

 

First, let’s take a look at the code you’ll need within your CTE:

 


SELECT ROW_NUMBER() OVER (PARTITION BY DuplicateSearch ORDER BY DuplicateSearch) [No of Rows],
ID_Column
FROM YourTable
WHERE DuplicateSearch IN (
SELECT DuplicateSearch
FROM YourTable
GROUP BY DuplicateSearch
HAVING COUNT (*) > 1
)

 

The WHERE clause in the query is used to find and limit your query only to those rows which are considered duplicates. It isn’t necessary, but may help performance for larger datasets.

 

The column ‘DuplicateSearch’ is the column you’d use to manually locate a duplicate. For example, if you have a table containing two rows with a 5 in a column, when you know there should only be one, you’d use that column in place of ‘DuplicateSearch’.

 

The ROW_NUMBER() function will count the number of rows based on thePARTITION BY you give it. Again, use you ‘DuplicateSearch’ column to partition on, as this will segregate each set of duplicates and give them a numeric value starting at 1, and incrementing by 1 for each additional value.

 

Next you’ll want to query your CTE to actually view the result:

 

SELECT * FROM YourTable
WHERE ID_Column IN (SELECT ID_Column FROM YourCTE WHERE [No of Rows] > 1)

 

The above code will query the table with duplicates and limit the dataset to the duplicates found within your CTE. The magic happens in the WHERE clause, a non-correlated subquery is used to return all the IDs of rows that have a ‘No of Rows’ (our ROW_NUMBER() column) greater than 1, indicating it’s a duplicate.

 

The code all as one piece looks like the following:

 

;WITH YourCTE AS ( SELECT ROW_NUMBER() OVER (PARTITION BY DuplicateSearch ORDERBY DuplicateSearch) [No of Rows], ID_Column FROM YourTable WHEREDuplicateSearch IN ( SELECT DuplicateSearch FROM YourTable GROUP BY DuplicateSearch HAVING COUNT (*) > 1 ) )
SELECT * FROM YourTable
WHERE ID_Column IN (SELECT ID_Column FROM YourCTE WHERE [No of Rows] > 1)

 

To actually delete the rows, all you need to do is swap SELECT * with DELETE, but please do check your results thoroughly in the SELECT first, and take a backup if necessary!

 

The above code assumes that there is only one column necessary in defining a duplicate, but if multiple columns are required, you’ll need to expand theROW_NUMBER()‘s PARTITION BY.

 

I’m a big advocate of proper planning and design when undertaking any project, which would help avoid issues such as duplicates in the first place, but sometimes these things are unavoidable.

Leave a comment