|
|
|
||||||||
|
Data Management Articles
Where's the key? A short note on de-duplicating data
To paraphrase a cliche - if I had five cents for every
primary key (PK) free table
that I've come across, I'd be able to buy a lifetime's supply of
Vegemite. This morning
I stumbled on yet another key-free table. The offending object had 30+ columns and
a few million rows. Not a big table by today's standards, but it still took an hour
or so to figure out the candidate keys (candidates for primary keys) and get rid
of duplicate rows (yes- there were more than a few of those). At the end of it I was
a tad irritated that I'd lost an hour of my time dealing with the consequences
of someone else's shoddy design. This was not the first time it's happened, nor will it
be the last. I could have had a chat with the offending developer, but thought it might
come out all wrong because of my out-of-proportion annoyance. Instead, I reckoned it
would be more useful, and cathartic, to put it down in writing. Hence this short note,
(or rant, depending on your point of view) on keyless tables.
Now, everyone and anyone who works with relational databases knows that primary keys are a Good Thing because they ensure uniqueness of rows. What is perhaps not as well appreciated is that table "keylessness", so to speak, comes in two forms: 1) Obvious - where there is simply no PK of any kind and 2) Not so obvious - where there is a PK, but it is not very helpful in ensuring data integrity. The first form is ... well, obvious as there is no key in sight. The second one occurs where there's a key that is autogenerated say by an Oracle sequence or SQL Server identity column. In this case the table could well contain duplicate rows that differ only in the value of the artificially generated key (sometimes known as surrogate keys). This, incidentally, is a really good reason for having unique indexes based on non-primary key columns for tables that have artificial primary keys. OK, so you've encountered one of these keyless wonders. How do you fix it? The two-step answer is simple - first you rid the table of duplicate rows and then you tack on a primary key. What follows are the details of this process. The straightforward, database independent way to de-duplicate data is to use a select distinct ... or select ..group by... idiom with the entire column list in the select and group by clause in the latter case. Typically one would create a temporary table to hold the results of the de-duplication, truncate the original table and then re-insert the de-duplicated data back into the original table. Using group by is my preferred approach because it also enables one to get a count for each record. To do this simply add count(*) to the column list. One can use this approach, coupled with the having clause, to return only the duplicate rows like so:
select
where < Column List > is a comma separated list of columns. < Column List >, count(*) ColumnC, from TableA group by < Column List > having count(*)>1 In Oracle one can delete duplicates in situ via a trick that employs the rowid pseudocolumn. The Oracle rowid is guaranteed to be unique because it is a pointer to the physical location of the row on disk. Here's the SQL:
delete from
where, once again, < Column List > is a comma separated list of columns.TableA where rowid > (select min(rowid) from TableA group by < Column List > having count(*)>1) Now let's look at a variant of the problem, where the key of the table is known, but, because integrity isn't enforced, there are rows that violate the key. The aforementioned techniques can't be used here because the remaining (non-key) columns might not be duplicated. However, it is still easy enough to identify the rows that violate the key using the in clause. Here's an example for a table has a composite key consisting of 3 varchar columns - ColumnA, ColumnB, and ColumnC:
-- T-SQL Syntax
where, yet again,< Column List > is a comma separated list of columns. Note that
if the columns are not varchar, they will need to be cast to varchar before concatenation
in the T-SQL version of the query. The Oracle version of the query works for mixed datatypes
because it does a column by column comparison. Once all duplicates have been identified
one can develop criteria to determine which rows are to be eliminated. These criteria would depend
on specific business rules and may involve non-key rows. select < Column List > from TableA where ColumnA+ColumnB+ColumnC in (select ColumnA+ColumnB+ColumnC from TableA group by ColumnA, ColumnB, ColumnC having count(*)>1) order by ColumnA, ColumnB, ColumnC -- Oracle Syntax select < Column List > from TableA where (ColumnA,ColumnB,ColumnC) in (select ColumnA, ColumnB, ColumnC from TableA group by ColumnA, ColumnB, ColumnC having count(*)>1) order by ColumnA, ColumnB, ColumnC Finally, once you're done eliminating dirty data, don't forget to add a primary key. If you don't, you can be sure that you'll be doing janitorial duties on the table again in the not so distant future. Staying with the above example, the SQL to add a primary key in Oracle is:
alter table TableA add constraint pk_tableA primary key (columnA, columnB, ColumnC)
where pk_tableA is the name of the primary key constraint.
In SQL Server, one needs to first make all key columns non-nullable before creating the
primary key. So, assuming the columns are all varchar(20), the required SQL is:
alter table tableA alter column columnA varchar(20) not null
Once this is done, you can rest assured that you won't be de-duplicating data on this table
ever again.
alter table tableA alter column columnB varchar(20) not null alter table tableA alter column columnC varchar(20) not null alter table TableA add constraint pk_tableA primary key (columnA, columnB, columnC) Back to the top |
||||||||
|
|