Oracle database design, development and administration
Project Management
Archived stuff
Cryptic crossword.
A bit about me.
Read my blog
Recent additions
Home
Data Management Articles

Where's the key? A short note on de-duplicating data

To paraphrase a cliche - if I had five cents for every primary key (PK) free table that I've come across, I'd be able to buy a lifetime's supply of Vegemite. This morning I stumbled on yet another key-free table. The offending object had 30+ columns and a few million rows. Not a big table by today's standards, but it still took an hour or so to figure out the candidate keys (candidates for primary keys) and get rid of duplicate rows (yes- there were more than a few of those). At the end of it I was a tad irritated that I'd lost an hour of my time dealing with the consequences of someone else's shoddy design. This was not the first time it's happened, nor will it be the last. I could have had a chat with the offending developer, but thought it might come out all wrong because of my out-of-proportion annoyance. Instead, I reckoned it would be more useful, and cathartic, to put it down in writing. Hence this short note, (or rant, depending on your point of view) on keyless tables.

Now, everyone and anyone who works with relational databases knows that primary keys are a Good Thing because they ensure uniqueness of rows. What is perhaps not as well appreciated is that table "keylessness", so to speak, comes in two forms: 1) Obvious - where there is simply no PK of any kind and 2) Not so obvious - where there is a PK, but it is not very helpful in ensuring data integrity. The first form is ... well, obvious as there is no key in sight. The second one occurs where there's a key that is autogenerated say by an Oracle sequence or SQL Server identity column. In this case the table could well contain duplicate rows that differ only in the value of the artificially generated key (sometimes known as surrogate keys). This, incidentally, is a really good reason for having unique indexes based on non-primary key columns for tables that have artificial primary keys.

OK, so you've encountered one of these keyless wonders. How do you fix it? The two-step answer is simple - first you rid the table of duplicate rows and then you tack on a primary key. What follows are the details of this process.

The straightforward, database independent way to de-duplicate data is to use a select distinct ... or select ..group by... idiom with the entire column list in the select and group by clause in the latter case. Typically one would create a temporary table to hold the results of the de-duplication, truncate the original table and then re-insert the de-duplicated data back into the original table. Using group by is my preferred approach because it also enables one to get a count for each record. To do this simply add count(*) to the column list. One can use this approach, coupled with the having clause, to return only the duplicate rows like so:

select
  < Column List >,
  count(*)
  ColumnC,
from
  TableA
group by
  < Column List >
having
count(*)>1

where < Column List > is a comma separated list of columns.

In Oracle one can delete duplicates in situ via a trick that employs the rowid pseudocolumn. The Oracle rowid is guaranteed to be unique because it is a pointer to the physical location of the row on disk. Here's the SQL:

delete from
  TableA
where
   rowid >
  (select
    min(rowid)
  from
    TableA
  group by
    < Column List >
  having
    count(*)>1)

where, once again, < Column List > is a comma separated list of columns.

Now let's look at a variant of the problem, where the key of the table is known, but, because integrity isn't enforced, there are rows that violate the key. The aforementioned techniques can't be used here because the remaining (non-key) columns might not be duplicated. However, it is still easy enough to identify the rows that violate the key using the in clause. Here's an example for a table has a composite key consisting of 3 varchar columns - ColumnA, ColumnB, and ColumnC:

-- T-SQL Syntax

select
< Column List >
from
  TableA
where   ColumnA+ColumnB+ColumnC
in
  (select
    ColumnA+ColumnB+ColumnC
  from
    TableA
  group by
    ColumnA,
    ColumnB,
    ColumnC
  having
    count(*)>1)
order by
  ColumnA,
  ColumnB,
  ColumnC

-- Oracle Syntax

select
< Column List >
from
  TableA
where   (ColumnA,ColumnB,ColumnC)
in
  (select
    ColumnA,
    ColumnB,
    ColumnC
  from
    TableA
  group by
    ColumnA,
    ColumnB,
    ColumnC
  having
    count(*)>1)
order by
  ColumnA,
  ColumnB,
  ColumnC

where, yet again,< Column List > is a comma separated list of columns. Note that if the columns are not varchar, they will need to be cast to varchar before concatenation in the T-SQL version of the query. The Oracle version of the query works for mixed datatypes because it does a column by column comparison. Once all duplicates have been identified one can develop criteria to determine which rows are to be eliminated. These criteria would depend on specific business rules and may involve non-key rows.

Finally, once you're done eliminating dirty data, don't forget to add a primary key. If you don't, you can be sure that you'll be doing janitorial duties on the table again in the not so distant future. Staying with the above example, the SQL to add a primary key in Oracle is:

alter table TableA add constraint pk_tableA primary key (columnA, columnB, ColumnC)

where pk_tableA is the name of the primary key constraint. In SQL Server, one needs to first make all key columns non-nullable before creating the primary key. So, assuming the columns are all varchar(20), the required SQL is:

alter table tableA alter column columnA varchar(20) not null

alter table tableA alter column columnB varchar(20) not null

alter table tableA alter column columnC varchar(20) not null

alter table TableA add constraint pk_tableA primary key (columnA, columnB, columnC)

Once this is done, you can rest assured that you won't be de-duplicating data on this table ever again.

Back to the top