Oracle database design, development and administration
Project Management
Archived stuff
Cryptic crossword.
A bit about me.
Read my blog
Recent additions
Home
Data Management Articles

Dips on the road to data quality

Introduction:

An old colleague and friend once said to me, "If there's a way to mess-up data, you can be sure that a user will find it eventually." The friend's name is Shankar - and his surname happens to be Murthy. A fortuitous, near-concurrence of his name and statement with a well-known law governing the occurrence of mess-ups, leads me to dub his observation Murthy's Law.

Over the years I've come across a number of (allegedly well-designed) databases that suffer from the bane of dirty data. In keeping with our profession's penchant for TLAs I've termed such problems DIPs - Data Integrity Problems. Conversations with other data professionals confirm the ubiquity of DIPs, adding up to an empirical "proof" of Murthy's Law. (Before pedants jump on my case - I know it's not a proof; which is precisely why I've double quoted the word.) Now, assuming that Murthy's Law holds, the only thing we data professionals can do is minimise its impact by reducing the number of ways in which data can be corrupted.

Designing for and implementing data integrity is one of our main responsibilities as data professionals. We take this seriously. We enforce integrity through a variety of database features such as datatypes, constraints (primary key, referential and check) and triggers. Despite our best intentions and efforts, however, we find that Murthy's Law is pervasive in our data environments. Why is this so? In this piece (or polemic, if you like) I'd like to present a possible reason why so many of us struggle to control data quality in our environments.I hope some of what I say here has some resonance with your own experiences.

OK, So what's the reason?

Data repositories, according to Murthy, must be actively protected from corrupt data, else corruption is guaranteed. This is a nice segue into my number one reason for corrupt data: Packaged applications rarely implement referential integrity in the database. In my experience, this is one of the main causes of data corruptions sneaking their way into the enterprise. What follows is a true story that some of you may find familiar...

Some years ago I implemented a reporting system which sourced most of its data from a well-known CRM application (which shall remain unnamed). The CRM application did not (and still doesn't) implement referential integrity in the database. The application database had (has) a few thousand tables with nary a foreign key in sight. In contrast, the reporting database, designed by yours truly, has approximately 60 tables and 120 foreign key constraints. Now, when we ran the data conversion programs for the first time, we discovered more than ten different cases of referential integrity violations in the source database. Each of these DIPs affected several records (some as many as hundreds).

These DIPs were subsequently reported to the local CRM administrators, who, after initial disbelief, agreed to implement interfaces to find and delete the offenders on a nightly basis. As far as I know, these interfaces still run, and indeed find, new DIPs every now and then. This begs the following questions (which were all asked; answers in parentheses): How do these DIPs sneak their way into the database (they didn't know)? Isn't the CRM front-end supposed to implement data validation checks (yes!)? Aren't these checks supposed to obviate the need for referential integrity in the database (yes!)? From their answers, it was clear that application was not doing a very good job of enforcing data integrity.

This is just one instance of a problem, which by itself it doesn't indicate an issue of the magnitude I claim. However, given the popularity of this particular CRM offering, I suspect that others may (nay must!) be suffering from similar data indignities. Further, this application isn't the only packaged product guilty of not enforcing referential integrity. Others, several others, are offenders too. Ergo, their users too must suffer similar consequences.

My point: an enterprise's data is too important to be messed around this way. Application vendors need to take the responsibility to enforce data integrity correctly.

Back to the top