Data Management
Project Management
Archived stuff
Cryptic crossword.
A bit about me.
Read my blog
Recent additions
Home
Oracle Articles

An introduction to the SQL MODEL clause


Note: A script to create the example schema can be found here

Introduction:

The SQL model clause, first introduced in Oracle 10g, is a powerful addition to the repertoire of SQL programmers. With it, the canny analyst can address individual "cells" within a table ( this is very "non-relational", but as SQL programmers we're called upon to do all sorts of non-relational things). In effect, one can now do spreadsheet calculations using SQL. Vey powerful stuff indeed. This article is an introduction to the model clause. My aim is to provide enough background material and examples for readers to get comfortable with using this new feature offered in Oracle 10g.

The examples in the article are based on a table called MODEL_SALE, which contains monthly sales for 2004-2005 for regions (NORTH and SOUTH) by division (HERE and THERE) . A script to create and populate this table can be accessed here.

Terminology

There's some new terminology associated with the model clause. Well, most of it is not really new if you are familiar with analytic SQL and data warehouse terminology, but I'll run through a quick explanation for completeness:

Cell: This is an intersection between a specific row and column. Essentially, just envision the relational result set as a spreadsheet.

Rule: These are calculations applied at the level of a cell. There is new syntax associated with defining rules. Much of this will become clear as we go through a few examples below.

Partition: A partition defines a subset of data within which rules are applied. A rule is applied independently to each partition. This term is used in the same sense as in analytic SQL. In our examples the data is partitioned by region and division.

Dimension: These are keys to specific cells within a partition. A complete specification of partition and dimension uniquely defines a cell. In our examples the data is dimensioned by year and month.

Measure: These are (generally) numeric values to which rules are applied. In our examples sale amount is the measure.

A First Example:

The above terminology is best understood through an example. So here goes:

--query (1) - Generate sales amount for Jan 2005 based on previous year's data

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2005,1]=sale_amount[2004,1]*4)

This calculates the sales for Jan 2005 in all regions and divisions (partitions) based on the following rule: sales for Jan 2005 = 4*(Sales for Jan 2004). The rule variables (dimensions) are year and month. Note that it is possible to specify multiple rules within one statement. See (3) below for an example. The above SQL returns all rows, including unmodified ones. To return only newly generated rows, we modify the above as follows:

--query (2) -same as query (1), returning updated rows only

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows --return only updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2005,1]=sale_amount[2004,1]*4)

This returns only the newly generated rows - i.e Jan 2005 only. For clarity we'll use this option in many of the examples below.

As an aside it is worth mentioning that, despite the use of the word "update", the SQL model clause does not update or insert any data in MODEL_SALES. To modify or insert data one would have to feed the results of a MODEL SQL statement to an update or insert statement.

Upsert and Update semantics:

Rules can operate with either upsert or update semantics. The distinction between the two is best explained through an example. Consider the following statement:

--query (3) - Generate sales amount for Jan 2004 and Jan 2005

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules (
sale_amount[2004,1]=sale_amount[2004,1]*2, -- rule 1
sale_amount[2005,1]=sale_amount[2004,1]*4) -- rule 2

This calculates sales amounts for Jan 2004 (twice the preexisting value for Jan 2004) and 2005 (four times the preexisting value for Jan 2004). The former is an update of an existing value whereas the latter is a new value. By default, the model clause returns both updated and new values. This is called rules upsert. We can alter the SQL so that only updated values are returned. Here's how:

--query (4) - variant of (3); illustrating rules update

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules update (
sale_amount[2004,1]=sale_amount[2004,1]*2, -- rule 1
sale_amount[2005,1]=sale_amount[2004,1]*4) -- rule 2

In contrast to (3), this returns only updated values - the cells for Jan 2004 are returned whereas those for Jan 2005 aren't. This is rules update in action.

Update semantics or rules update have to be specified explicitly as shown above whereas rules upsert doesn't have to be specified explicitly, as it is the default.

Positional and Symbolic references:

Dimensions within a rule can be specified in one of two ways - by position or symbolically. In a positional reference the dimension is specified as a constant, as in queries 1 through 4 above. Here's another example of positional referencing:

--query (5) - positional reference

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules (
  sale_amount[2004,2]=sale_amount[2004,1]*4,
  sale_amount[2004,3]=sale_amount[2004,1]*4)

In the above query, dimensions are referenced by constants; 2004 for the year and 2 and 3 for the month .

Symbolic referencing uses boolean expressions instead of constants. Query (5) can be recast using symbolic references as follows:

--query (6) - symbolic reference

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules (
  sale_amount[sale_year=2004,sale_month between 2 and 3]=sale_amount[2004,1]*4)

In (6) only those dimension values for which the expressions evaluate to true will be used in computations. Note that, as in the case above above, symbolic references often result in compact rule expressions compared to positional references.

It is important to note that rules that use symbolic references will not generate new cells, irrespective of whether or not rules upsert is specified. To see this, run query (6) with sale_year=2005 on the left hand side of the rule. I'll discuss a way to get around this
later in the article. At the risk of belabouring the point, there is no such limitation for positional references. You can test this by modifying query(5) appropriately.

Wildcard dimensions and the cv() function:

One can specify all values of a dimension using the any keyword as follows:

--query (7) - wildcard dimension

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2004,any] =sale_amount[2004,cv()]*6) --cv():current value on lhs

The cv() function used above provides a way to reference the current value of the relevant dimension on the right hand side. cv() can also be used with arguments referring to specific dimensions, but I won't go into this here. Check the Oracle docs for details.

Wildcarding is obviously quite useful. However, there's a trap for the unwary: let's try the same statement with the year on the left hand side of the rule changed to 2005.

--query (8) - wildcard dimension 2

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2005,any] =sale_amount[2004,cv()]*6)

With rules upsert operating (reminder: this is the default option) we expect this statement to return rows for all regions and divisions for 2005. However, the statement returns no data. This is because "any" operates as a symbolic reference, and will therefore not generate any new cells (see section on positional and symbolic references above).

Using aggregate functions in rules:

One can use aggregate functions such as max() and avg() on the right hand side of rules. Here's an example:

--query (9) - aggregate functions

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2005,1] =avg(sale_amount) [2004,sale_month between 1 and 12]*2)

This assigns twice the average sale amount for 2004 to the Jan 2005 sale value. (Reminder: rules are applied independently to each partition.)

Looping:

The FOR loop provides a means to encapsulate multiple rules. It also provides a method to work around the limitations of symbolic referencing which I mentioned earlier. The syntax of the FOR construct is best explained through a quick example, where we generate sales values for 2005 based on data for the corresponding month in 2004:

--query (10) - for loop

select
  region,
  division,
  sale_year,
  sale_month,
  sale_amount
from
  model_sales
model return updated rows
partition by (region, division)
dimension by (sale_year,sale_month)
measures (sale_amount)
rules
  (sale_amount[2005,for sale_month from 1 to 12 increment 1] =sale_amount[2004,cv()]*6)

Contrast the results returned by query (10) with our attempt to generate sales values for 2005 using wildcard dimensions. I'll leave it to the reader to experiment on variations of the FOR the loop. Further details can be found in the
Oracle docs

Summary

This article introduced the MODEL clause, a new Oracle SQL extension that enables analysts to perform spreadsheet-like computations using SQL. I have attempted to cover just enough of the basics to enable readers to start experimenting with this powerful new capability. The topics covered include terminology (partitions, dimensions, measures and rules), rule specification and semantics, dimension referencing, wildcard dimensions, aggregation functions and looping. I hope this article has piqued your interest in the MODEL clause. As always, the Oracle documentation provides much more information and detail.

Back to the top