|
|
|
|
|
|
Data Management Articles
Rube Goldberg interfaces
ETL (Extract, Transform, Load) processes are an important element of data
warehouse environments. These processes, which are basically responsible for loading or
updating the data warehouse, pull data from various source systems (Extract), perform
some cleansing actions (Transform) and push the transformed data (Load) into the target data warehouse.
Since ETL processes serve as interfaces between the data warehouse and source systems they are often
referred to as data transfer interfaces, or simply interfaces.
These interfaces, which run periodically (daily, in most cases), are the only route through which the data warehouse is
updated. Considering the importance of ETL, it is paramount that these data interfaces be designed for
efficiency. By this I mean that they should be designed for maximum speed within technology
and budget constraints (the interfaces need to be technically correct too, but that, I trust,
goes without saying!). A direct consequence of this requirement is that data transfer interfaces
should be as simple as possible. Any unnecessary complexity would add CPU cycles (and possibly
increase storage and bandwidth requirements) thereby increasing process execution times.
So what does this have to do with Rube Goldberg
. Well, a Rube Goldberg machine.
is a device that performs a simple task in a needlessly complicated manner. By extension, I propose
the name Rube Goldberg Interface (RGI) for any data transfer interface that performs a data load or update
in an overly complicated or convoluted fashion. These beasts are more ubiquitous
than one might expect, however recognising them is not always straightforward. The remainder of this note
is devoted to tips on identifying potential offenders in your data environments. Characteristics
of Rube Goldberg interfaces include:
-
Excessive reliance on procedural code: Developers familiar with procedural
languages often write interfaces that employ procedural code (like PL/SQL)
instead of SQL. A particular nasty programming
practice is to loop through a cursor, record-by-record, transforming each record as one
goes along. This is a very inefficient way to process a large number of records.
Generally, the most efficient way to manipulate data stored in a
relational database is through SQL, because SQL processes the entire data set in one go.
For this reason SQL will perform much better than the corresponding procedural code, with the
performance differential improving as the number of records in the data set increases.
As such, SQL should always be the language of choice for an ETL process.
See my article on
designing efficient ETL processes for some Oracle specific tips
on converting procedural code to SQL-based code.
-
Superfluous steps: I've seen several interfaces that extract data from a source system,
write it out to a flat file, transfer the file over the network to the target server and finally load
the flat file into target database. It is often possible to eliminate the intermediate, flat file
steps through the use of database links (Oracle) or linked servers (SQL Server). This is particularly
useful for smaller data volumes (Note that these days "small" can mean millions of records!)
-
Conventional data inserts: Most databases offer features that allow one to load large
quantities of data using bulk loading (direct path inserts in Oracle and bulk inserts in SQL Server).
Bulk loading techniques should be preferred over standard SQL insert statements for large data loads. My
ETL article describes how direct path inserts
are done in Oracle.
-
Logged transactions: One of the main features of databases is the ability to fold
a number of data manipulation operations into a single logical
transaction. Transactions are central to databases because they
ensure that data integrity is maintained when several concurrent users access the same data.
One of the features of transactions is that they can be
rolled back, or reversed, in case of failure. To enable rollbacks, all operations occurring within a transaction must be
written to a transaction log.
Obviously writing to the log takes resources (and time). In the case of ETL operations, however,
logging is unnecessary because the data can generally be re-extracted from the source systems if need be.
For this reason, data inserts should be carried via a non-logging mode. Details of this are database
specific - my ETL article describes how this can
be achieved in Oracle.
-
Inappropriate use of the technology du jour (can you spell XML?): Much of the hype surrounding XML
as a "silver bullet" solution for data transfer interfaces is just that - hype.
For high volume interfaces the transfer is much more efficiently achieved via one
of the following:
- Database to database transfers wherever possible (see point 2 above) or,
- Using a simple flat file format (comma-separated, for example) which is subsequently
loaded via a bulk insert (see point 3 above), when a direct database to database
transfer is not possible.
For transfers of large quantities of data it is almost always inappropriate to use XML. The
usual reason for using XML is that it facilitates the
standardisation
of interfaces. Yup, you might be able to standardise interfaces for whatever that's worth,
but performance will be such a dog that you may as well key in the data manually. OK I exaggerate,
but not by much...and only to make a well-founded point. XML has its place as an data exchange
format between transactional systems, or even for trickle data feeds into data warehouses. However,
it is almost never the solution for high volume data transfer interfaces.
Before concluding I should mention that sometimes there is no option but to go the Rube Goldberg way.
Here's an example from experience: A large multinational organisation had a
misguided information technology policy that disallowed
direct database access to source systems. In this situation the data warehouse architect had no
option but
to design a Rube Goldberg interface.
That said, most complex interfaces I've seen have no such constraints - they achieved
their Rube Goldberg-ness solely through bad design. I hope this short piece helps you identify
these beasts in your data environments.
Back to the top
|
| |
|