|
|
| ||||||||
|
Oracle Articles
Oracle SQL Aggregation Extensions: This article is introduces Oracle's CUBE and ROLLUP extensions to the GROUP BY clause. The schema and sample data population scripts that accompany the discussion can be accessed here. The usual way to aggregate data in relational databases is via the SQL group by clause. A limitation of this method is that one cannot get higher level aggregations within a single statement. For example, consider the following SQL statements:
--statement 1 select t1.region_name, t2.year, sum(t2.amount) from region t1, sales t2 where t1.region_id=t2.region_id group by t1.region_name, t2.year order by 1, 2 and
--statement 2 select t1.region_name, t2.year, t2.month, sum(t2.amount) from region t1, sales t2 where t1.region_id=t2.region_id group by t1.region_name, t2.year, t2.month order by 1, 2, 3 The first statement yields aggregated sales by region and year and the second gets the sales by region, year and month. Note that the "order by" is not required, but has been put in to ensure that the results appear in the specified order ("group by", by itself, does not guarantee ordering). Very often though, one may additionally require the higher level aggregates, i.e. sales by region (over all years), sales by year (over all regions) and a grand total (over all years and regions). Normally one would have to write separate SQL statements to get these. Oracle now (as of version 8i) offers the "Cube" and "Rollup" extensions to the "Group by" clause, which enables us to do higher level aggregations within the same statement. The following paragraphs explore these new aggregation extensions to SQL. The CUBE extension, applied to statement 1, gets us the sales by region, sales by year and total sales, along with the other rows returned by the standard grouping. Here's the SQL:
--statement 3 select t1.region_name, t2.year, sum(t2.amount) from region t1, sales t2 where t1.region_id=t2.region_id group by cube(t1.region_name, t2.year) order by 1, 2 This produces what's often called a cross-tabular report - i.e. all possible higher level aggregates are calculated. Such a report is suitable when sums are being done across independent variables or dimensions (in data warehouse terminology). In many cases, however, dimensions are not independent. A simple example is a month/year hierarchy - the two are dependent - but often appear as separate grouping columns. In such a situation one may want to get aggregate sales by year (over all months), but the sums by month over all years would generally not be of interest. The "Rollup" extension offers a way to specify which columns are to be summed over. Here's statement 2 with the rollup clause applied to all the grouping columns:
--statement 4 select t1.region_name, t2.year, t2.month, sum(t2.amount) from region t1, sales t2 where t1.region_id=t2.region_id group by rollup(t1.region_name, t2.year, t2.month) order by 1, 2, 3 This returns the sum over months by year and region, sum over year and month by region, and a grand total over year, month and region, and of course all rows returned by statement 2. From an inspection of the higher level aggregates it is seen that rollups are done from right to left within the rollup clause. Rollups can be done over any dimensions - not only hierarchically related ones. For instance, we can modify the second last line of statement 4 to "rollup(t2.month, t1.region, t2.year)", although you may be hard pressed to attach a sensible business meaning to some of the rows returned in the result. It should also be noted that the both cube and rollup can be applied to a part of the grouped columns - for instance, the statement:
--statement 5 select t1.region_name, t2.year, t2.month, sum(t2.amount) from region t1, sales t2 where t1.region_id=t2.region_id group by t1.region_name, rollup(t2.year, t2.month) order by 1, 2, 3 is perfectly valid. It returns all rows returned by statement 4 except the grand total. You are encouraged to check out what happens if you apply the cube clause to a subset of the grouping columns in statement 1. If you've run some of the above statements using the example data, you would have noticed that the summed over columns are populated by nulls in the higher level aggregate rows. So how do we distinguish these rows from genuine nulls? One way is to use the grouping() function, which returns 1 if the argument column is aggregated and 0 if it isn't. Here's an example based on statement 3:
--statement 6 select t1.region_name, t2.year, sum(t2.amount), grouping(t1.region_name) region_summed, grouping(t2.year) year_summed from region t1, sales t2 where t1.region_id=t2.region_id group by cube(t1.region_name, t2.year) order by 1, 2 What's more, one can return only the higher level aggregates by using the grouping() function in a "having" clause, like so:
--statement 6 select t1.region_name, t2.year, sum(t2.amount), grouping(t1.region_name) region_summed, grouping(t2.year) year_summed from region t1, sales t2 where t1.region_id=t2.region_id group by cube(t1.region_name, t2.year) having grouping(t1.region_name)=1 or grouping(t2.year)=1 order by 1, 2 Oracle 9i offers a few more ways to filter data and specify groupings, but I won't go into these here. Please check the Oracle docs for details. The relevant reference is chapter 18 of the "Oracle 9i Data Warehousing Guide". In closing: aggregation extensions to SQL can ease the creation and maintenance of summary tables in data warehouses. Further, they enable you to save your application server some stress, and your end-users some time, by offering a way to do rollups and crosstabs directly in the database. Back to the top | ||||||||
|
|