Oracle database design, development and administration
Project Management
Archived stuff
Cryptic crossword.
A bit about me.
Read my blog
Recent additions
Home
Data Management Articles

Rube Goldberg interfaces

ETL (Extract, Transform, Load) processes are an important element of data warehouse environments. These processes, which are basically responsible for loading or updating the data warehouse, pull data from various source systems (Extract), perform some cleansing actions (Transform) and push the transformed data (Load) into the target data warehouse. Since ETL processes serve as interfaces between the data warehouse and source systems they are often referred to as data transfer interfaces, or simply interfaces. These interfaces, which run periodically (daily, in most cases), are the only route through which the data warehouse is updated. Considering the importance of ETL, it is paramount that these data interfaces be designed for efficiency. By this I mean that they should be designed for maximum speed within technology and budget constraints (the interfaces need to be technically correct too, but that, I trust, goes without saying!). A direct consequence of this requirement is that data transfer interfaces should be as simple as possible. Any unnecessary complexity would add CPU cycles (and possibly increase storage and bandwidth requirements) thereby increasing process execution times.

So what does this have to do with Rube Goldberg . Well, a Rube Goldberg machine. is a device that performs a simple task in a needlessly complicated manner. By extension, I propose the name Rube Goldberg Interface (RGI) for any data transfer interface that performs a data load or update in an overly complicated or convoluted fashion. These beasts are more ubiquitous than one might expect, however recognising them is not always straightforward. The remainder of this note is devoted to tips on identifying potential offenders in your data environments. Characteristics of Rube Goldberg interfaces include:

  1. Excessive reliance on procedural code: Developers familiar with procedural languages often write interfaces that employ procedural code (like PL/SQL) instead of SQL. A particular nasty programming practice is to loop through a cursor, record-by-record, transforming each record as one goes along. This is a very inefficient way to process a large number of records. Generally, the most efficient way to manipulate data stored in a relational database is through SQL, because SQL processes the entire data set in one go. For this reason SQL will perform much better than the corresponding procedural code, with the performance differential improving as the number of records in the data set increases. As such, SQL should always be the language of choice for an ETL process. See my article on designing efficient ETL processes for some Oracle specific tips on converting procedural code to SQL-based code.


  2. Superfluous steps: I've seen several interfaces that extract data from a source system, write it out to a flat file, transfer the file over the network to the target server and finally load the flat file into target database. It is often possible to eliminate the intermediate, flat file steps through the use of database links (Oracle) or linked servers (SQL Server). This is particularly useful for smaller data volumes (Note that these days "small" can mean millions of records!)


  3. Conventional data inserts: Most databases offer features that allow one to load large quantities of data using bulk loading (direct path inserts in Oracle and bulk inserts in SQL Server). Bulk loading techniques should be preferred over standard SQL insert statements for large data loads. My ETL article describes how direct path inserts are done in Oracle.


  4. Logged transactions: One of the main features of databases is the ability to fold a number of data manipulation operations into a single logical transaction. Transactions are central to databases because they ensure that data integrity is maintained when several concurrent users access the same data. One of the features of transactions is that they can be rolled back, or reversed, in case of failure. To enable rollbacks, all operations occurring within a transaction must be written to a transaction log. Obviously writing to the log takes resources (and time). In the case of ETL operations, however, logging is unnecessary because the data can generally be re-extracted from the source systems if need be. For this reason, data inserts should be carried via a non-logging mode. Details of this are database specific - my ETL article describes how this can be achieved in Oracle.


  5. Inappropriate use of the technology du jour (can you spell XML?): Much of the hype surrounding XML as a "silver bullet" solution for data transfer interfaces is just that - hype. For high volume interfaces the transfer is much more efficiently achieved via one of the following:
    • Database to database transfers wherever possible (see point 2 above) or,
    • Using a simple flat file format (comma-separated, for example) which is subsequently loaded via a bulk insert (see point 3 above), when a direct database to database transfer is not possible.
    For transfers of large quantities of data it is almost always inappropriate to use XML. The usual reason for using XML is that it facilitates the standardisation of interfaces. Yup, you might be able to standardise interfaces for whatever that's worth, but performance will be such a dog that you may as well key in the data manually. OK I exaggerate, but not by much...and only to make a well-founded point. XML has its place as an data exchange format between transactional systems, or even for trickle data feeds into data warehouses. However, it is almost never the solution for high volume data transfer interfaces.
Before concluding I should mention that sometimes there is no option but to go the Rube Goldberg way. Here's an example from experience: A large multinational organisation had a misguided information technology policy that disallowed direct database access to source systems. In this situation the data warehouse architect had no option but to design a Rube Goldberg interface. That said, most complex interfaces I've seen have no such constraints - they achieved their Rube Goldberg-ness solely through bad design. I hope this short piece helps you identify these beasts in your data environments.

Back to the top