|
|
Data Management Articles
Conquering data chaos - a first step to data governance
Data chaos - often the status quo:
A majority of the companies I've consulted or worked for have suffered, to a large or small extent,
from what I call data chaos. The term has been used in various contexts by others.
In the present article I take it to mean - very literally - data related confusion.
My meaning is best illustrated by listing some examples that are typical of this affliction. Here are a few
that will no doubt be familiar to some of my fellow data professionals:
-
Existence of several desktop data repositories that hold business-critical information.
In most such cases the term "data repository" is a gross misnomer, as the said repository is a
spreadsheet or text document. The IT department is of course blissfully unaware of these data
sources, until the fateful moment when the user's hard disk dies, taking with it the data. Further,
quite apart from the recoverability of the data, there are also issues of data consistency, quality and
security. Rory Blyth's
classic cartoon on using
Excel
as a database illustrates some of the perils of "spreadsheet as database" rather nicely.
-
Uncontrolled, ad-hoc expansion of existing databases. This happens more often in smaller corporate
environments, where the IT department does not have the political clout to enforce proper development
procedures. Here's an example (variations of which I have seen innumerable times, in a
variety of flavours):
a business user requests < insert unreasonable request here > to be done within
< insert unreasonable time here >. The request is passed on to The Long-Suffering Corporate
Programmer who, having no choice but to shoot from the hip, then proceeeds to hack out a quick and
dirty solution. The solution entails, among other things, creating a bunch of database tables
without proper regard for the current database structure. Niceties such as data integrity, consistent
nomenclature etc. are of course lost in the ensuing scramble to meet the deadline.
-
Unreliable or contradictory reports. Example: two departments in the same company come up with different
numbers for the same quantity. This situation is usually characterised by multiple data sources
holding inconsistent information. Occasionally it can even be that a single data source holds
contradictory data (in two different fields). Honest, I have seen this happen more than once.
In a nutshell: data chaos is characterised by uncontrolled proliferation of
data without any regard to security, quality,integrity, structure, or indeed any overall strategy.
Obviously such a situation stinks from
a data governance perspective.
Something needs to be done to control data
entropy,
even if for no other reason than
regulatory compliance. The last is a pretty good reason, for upper management anyway, as non-compliance
could get one into trouble.
In the following section I'm going to discuss first steps towards conquering data chaos. The discussion
may be particularly relevant to medium to "smaller" large organisations; those who tend to suffer
more data chaos because their chronically underfunded IT departments have no resources to tame the monster.
Conquering data chaos:
So what can be done to tackle this disarray in our data world? In a nutshell, the steps are as follows:
-
Map out all existing data repositories.
-
Fix existing problems.
-
Establish information system development and change controls processes to prevent future problems.
A word of caution: note that the present article discusses the above from a data perspective
only. Some organisations might incorporate the above in an integrated strategy that encompasses
all IT functions (including, for example, development and operations). That said, I'd like to elaborate
on the above steps just a bit. It isn't possible to go into much depth in an article this size, but
I hope to give a sense of what is needed and the effort involved.
1. Map out all existing data repositories
The first step is to obtain a comprehensive list of all data repositories in the organisation,
including all databases, spreadsheets, text documents, scraps of paper etc. that hold information that
is important to the business. At the very least, this list will include the following information for
each data source:
- Repository name and brief description of functionality
What data does the repository hold? What is the data used for? How is the data
accessed and manipulated?
- Owner - Who owns the data?
- Description - What is the repository format (spreadsheet, file database, text
document etc.)? Be sure to include software brand name, manufacturer and version if relevant. You may be
surprised at the number of obscure and/or obsolete products that emerge from this exercise.
- Location - Where does the repository sit (which server, desktop, laptop etc)?
- Access mode - How is it accessed (web, direct etc)?
- Interfaces - Are there any data transfers to and from other systems? These should be listed
and documented in detail.
- Documentation - unfortunately this rarely exists!
Obviously, this information needs to be collected from the business owners of the data. You are
going to need their cooperation. How might you get that? I'll discuss that towards the end of the
article. For the moment let's move on.
Once you've gathered the data you need to compile it
into a coherent document. You will want to make mini dataflow
diagrams to depict interfaces, particularly for systems that have a large number of interconnections.
You'll also need to highlight weaknesses (and there will be many of these) from the point of
view of data security and quality and make recommendations to fix these. These recommendations
should also be summarised in a nice digestible form for upper management. Specifics of
recommendations depend on your environment, but typically they would come under the ambit of one-off
projects or tasks that fix existing problems.
2. Fix existing problems
So, what might your executive summary contain. It's a pretty safe bet that many organisations will
have similar problems. From experience, I would guess that the following action items may make your
shortlist:
- Consolidate desktop databases, spreadsheets and unstructured data (text files etc.). This might include:
-
Migrate desktop databases to server- based
databases. This will also ensure that the databases are properly backed up (I'm assuming here that
your server-based systems are backed up nightly. They are, right?).
The migration will entail a major clean-up of the data as it is likely to be riddled with
corruptions and inconsistencies. You will also want to take this opportunity to redesign some if
not all the
databases; particularly the ones that sit
in spreadsheets or (horror!) text documents. Some databases
may well turn out to be non-critical. Good, these go out of the window.
-
Merge databases that contain similar information. There are often several databases that hold
unique transactional information along with common lookup data. These are candidates for consolidation.
However, be warned, this can be a tricky business as merging common data is often far from
easy because of inconsistencies (differences in nomenclature, datatypes, domains, for example).
-
Use a content management system for unstructured data.
- Document undocumented data repositories and associated processes to guard against the
bus factor - "bus" as in when the only person who knows anything about the system gets run
over by a mass transit vehicle. Note that both business and technical aspects need to be
recorded.
-
Streamline inefficient data flow processes. Very often, organisations have batch jobs that move
masses of data from A to B in the most convoluted fashion (see my
article on Rube Goldberg Interfaces for more on this). These are usually legacy jobs that
were set up years ago, and have been running ever since. There may be some obvious improvements
that can be applied to some of these. It may even turn out that some of these processes can
be done away with, as nobody needs the data any more.
-
Opportunities for upgrading systems. This may also be a good time to consider upgrading to a newer
version of database / application software. You will be surprised at how many allegedly
critical systems run on software versions that are on the brink of being desupported.
-
Formalise ownership of data and databases. This one is important: there should be clear accountability
as who is responsible for the data from the business (data) and technical (database) side.
The former is usually a key business user (a senior business analyst, say) and the latter
a technical data professional.
The above are some common issues that I've seen in various organisations. I would love to be able
to predict all that will turn up in your environment; it would afford me an excellent living as a data consultant
- but alas, I have no crystal ball.
3. Establish information system development and control processes.
Fine, so now you've righted all the data wrongs in your environment. Next you have to ensure that
entropy
remains under control. My use of the word entropy in the context of physics is deliberate,
as the analogy with the
second law of thermodynamics is really quite apt. Maintaining order requires effort because
if left untended, your data will end up in a state of high entropy (or disorder) again.
You need to establish IT processes to control the natural tendency of your data environment to move
from order to chaos. This is a huge effort, because it is about establishing and enforcing
a change in the way the IT department thinks and works, right across the board. Although the primary concern in this
article is
data management, you also need to look at
other IT processes including
software development processes , project management
and maybe even aspects of service management
(whew, thank you Wikipedia!). Obviously, I can't even begin to go into any of these areas in here, as
I've rambled on quite a bit already. However, I do intend to write more about them in
the future, as they are all relevant to data professionals.
...And Finally:
This article has discussed what needs to be done to address the problem of data chaos.
I'd like to conclude the piece by addressing a point that I have studiously avoided thus far -
how does one get started? Obviously, this needs to be an enterprise initiative - it isn't enough
for IT to unilaterally declare the onset of hostilities against data chaos. Here are some
of pointers on how to get the organisation involved:
Question: Who should sponsor the exercise?
Answer: In my experience, it is best to get upper management to sponsor the exercise.
Question: How do you do that?
Answer: The CIO / IT manager needs to make a case to his colleagues, highlighting the
dangers of data chaos. Given the prominence that corporate governance has gained lately,
the CIO shouldn't have too much trouble making the point.
Question: What needs to be done once management approves the initiative?
Answer: Frame a plan for how data will be gathered. A good bit of time will be spent on
creating a data repository survey questionnaire discussed above.
When this is done, you will need end-users to participate
in the survey, so a communication needs to be sent out to key end-users informing them about
what's happening. The IT manager may need to speak to department heads before the communication
is dispatched, so that everyone knows what's coming. This will also help expedite
completion of the survey forms. You may (will!) also need to schedule times to speak to key
users, as the completed survey forms could raise further questions.
I could go on filling in more detail, but that would be elaborating on the broad steps
that I have already covered. As in many mathematical texts, messy details are are best
left as an
exercise for the reader. So I'll leave it here, wishing you all the very best in your efforts to conquer data chaos.
Back to the top
|