About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Friday, September 14, 2007

Duplicates When analyzing table structures, it isn't only interesting to see the distinct values from a column or group of columns, but to study if there are any duplicates. In order to do that, I'm identifying which of the columns would make my records unique within the view/table, and then I'm grouping them and selecting only those records for which multiple occurrences were found. In this case there could be two reasons behind duplication: 1. The chosen columns don't identify a unique record. 2. One of the JOINs is creating troubles, maybe I forgot to add one constraint or the JOIN is partially incorrect. --checking for duplicates, <table_name> is placeholder for a table/view, while <column_1> , … , <column_n> for columns from the given table SELECT <column_1>,<column_2> --,<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --,<column_n> HAVING count(*)>1 To select the records which create the duplication, all we have to do is to JOIN the previous query with the base table on the chosen group of columns: SELECT A.* FROM <table_name> A JOIN ( SELECT <column_1>,<column_2> --,<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --,<column_n> HAVING count(*)>1 ) B ON A.<column_1> = B.<column_1> AND A.<column_2> = B.<column_2> --AND A.<column_n> = B.<column_n> Of course, the non-duplicate records can be shown with a LEFT JOIN: SELECT A.* FROM <table_name> A LEFT JOIN ( SELECT <column_1>,<column_2> --<column_n> FROM <table_name> GROUP BY <column_1>,<column_2> --<column_n> HAVING count(*)>1 ) B ON A.<column_1> = B.<column_1> AND A.<column_2> = B.<column_2> --AND A.<column_n> = B.<column_n> WHERE B. IS NULL Note: <column_1>, … , <column_n> are supposed to contain no NULLs.

No comments: