|
|
| ||||||||
|
Oracle Articles
Oracle Analytic Functions: An Introduction Note: A script to create and populate the sample schema can be found here Introduction: SQL is a compact yet versatile language - a surprisingly large variety of questions can be answered using standard SQL. However, certain classes of questions are extremely hard to answer using standard SQL. For example consider the following:
Ranking functions: These functions rank records within a set or subset of data according to specified criteria. Their utility is best explained by working through a set of examples. Consider the following: we want to rank divisions and regions by sales for 2004. The traditional SQL solution to the problem would be as follows:
--query (1) - region / division ranked by sales for 2004 select region_name, division_name, sum_amount, rownum from (select t1.region_name, t2.division_name, sum(t3.amount) sum_amount from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 group by t1.region_name, t2.division_name order by 3 desc) A subquery is required because Oracle assigns row numbers before the ordering is done. The inner query does the ordering and the outer query does the ranking using rownum. Now compare the above query to the following, much more compact, one which uses an analytical ranking function:
--query (2) - region / division ranked by sales for 2004 select t1.region_name, t2.division_name, sum(t3.amount), rank() over (order by sum(t3.amount) desc) sales_rank from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 group by t1.region_name, t2.division_name As you'll probably agree, query (2) is far more compact than query (1). Now to deconstruct the unfamiliar bits in query (2):
--query (3) - division ranked by 2004 sales for each region
The only difference between query (2) and (3) is the partition by clause, which tells Oracle
that the data is to be ranked within each region. In general, the partition by clause defines
subgroups within which the analytic operation (aggregation, ranking etc.) is to be performed.select t1.region_name, t2.division_name, sum(t3.amount), rank() over (partition by t1.region_name order by sum(t3.amount) desc) sales_rank_by_region from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 group by t1.region_name, t2.division_name Other functions with usage similar to rank() are: dense_rank(), percent_rank() cume_dist() and ntile(). Check the Oracle documentation for details on these. The relevant book is the Oracle Database Data Warehousing Guide, available from the Oracle documentation site. Windowing aggregate functions: These functions can be used to perform aggregate operations - sums, counts, averages etc - on subgroups of data. Further, these aggregates can be performed over user specified windows or ranges. Again, the syntax is best illustrated through some examples, so here we go. The SQL below returns the monthly and year-to-date sales for 2004 by region and division:
--query (4) - YTD sales by region and division
The syntax, as you'll notice, is very similar to that used for ranking functions. The various
analytic clauses
are explained below:
select t1.region_name, t2.division_name, t3.month, t3.amount month_sales, sum(t3.amount) over (partition by t1.region_name, t2.division_name order by t3.month rows unbounded preceding) YTD_sales from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004
--query (5) - 3 month rolling sum
In contrast, the following SQL calculates a centered three month rolling sum:select t1.region_name, t2.division_name, t3.month, t3.amount month_sales, sum(t3.amount) over (partition by t1.region_name, t2.division_name order by t3.month rows 2 preceding) three_mth_sum from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004
--query (6) - centered 3 month rolling sum
The windowing clause has other variations based on ranges instead of number of rows. Further, it is
possible to vary the window size for each row. Check the Oracle documentation for details.select t1.region_name, t2.division_name, t3.month, t3.amount month_sales, sum(t3.amount) over (partition by t1.region_name, t2.division_name order by t3.month rows between 1 preceding and 1 following) ctrd_three_mth_sum from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 With the above syntax one can write single SQL statements to answer more complicated questions such as quarterly sales for 2004 by region and division. Here's the SQL:
--query (7) - quarterly sales for 2004 by region and division
Here we used analytical SQL to calculate a three monthly sum in the subquery, and then picked out
the quarter end months in the outer query.select region_name, division_name, month/3 quarter, sum_3_months quarterly_sales from (select t1.region_name, t2.division_name, t3.month, sum(t3.amount) over (partition by t1.region_name, t2.division_name order by t3.month rows 2 preceding) sum_3_months from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004) where month in (3,6,9,12) We have dealt exclusively with sum in this discussion. There are quite a few other aggregating functions. Some commonly used ones include: avg, max, min and count. Check the Oracle documentation for more details. Reporting aggregate functions: Reporting aggregate functions return an aggregated value for the entire partition. This value, which is the same for each row in the partition, is returned with each row in the partition. Available functions include: sum, avg, max, min and count. The essential difference between reporting and windowing aggregations is the absence of a windowing clause (order by) in the former. Here's an example that calculates the maximum monthly sales by region and division:
--query (8) - maximum monthly sales for 2004 by region and division (all months)
This returns the maximum sales by region and division with every row - the same value is returned
with each row.
We can now use the above query to get just the maximum sales months like so:select t1.region_name, t2.division_name, t3.month, t3.amount mthly_sales, max(t3.amount) over (partition by t1.region_name, t2.division_name) max_mthly_sales from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004
--query (9) - maximum monthly sales for 2004 by region / division (max mths only)
Here analytical SQL calculates the maximum in the subquery (same as query (8)), and we pick
out the rows for which the maximum equals monthly sales in the outer query.select region_name, division_name, month max_sales_month, max_mthly_sales from (select t1.region_name, t2.division_name, t3.month, t3.amount mthly_sales, max(t3.amount) over (partition by t1.region_name, t2.division_name) max_mthly_sales from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004) where mthly_sales=max_mthly_sales Oracle also offers the ratio_to_report function, which calculates ratios based on partition-wide aggregates. Here's an example that reports the ratio of sales by region and division to total regional sales:
--query (10) - sales by region/div to total region sales
A couple of points to note in query (10):
select t1.region_name, t2.division_name, sum(t3.amount), sum(sum(t3.amount)) over (partition by t1.region_name) region_total, ratio_to_report(sum(t3.amount)) over (partition by t1.region_name) region_ratio from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 group by t1.region_name, t2.division_name
--query (11) - sales by region/div to total sales
Here we've simply omitted the partition by clause, thereby instructing Oracle to treat the
entire data set as a single partition.select t1.region_name, t2.division_name, sum(t3.amount), sum(sum(t3.amount)) over () region_total, ratio_to_report(sum(t3.amount)) over () region_ratio from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id and t3.year=2004 group by t1.region_name, t2.division_name There is, of course, much more to reporting aggregates. Check the documentation for lots more. Lag/Lead functions: Comparing data across rows is hard using standard SQL. The lag and lead analytical functions are designed for just this type of problem. Here's an example:
--query (12) - yearly sales by region/division and comparison to previous year
Here the lag function compares sales for 2004 to that for 2003. The usage for
lead is very similar - see the documentation for some examples.select t1.region_name, t2.division_name, t3.year, sum(t3.amount) yearly_sales, sum(t3.amount)-lag(sum(t3.amount),1) over (partition by t1.region_name, t2.division_name order by t3.year) diff_prev_year from region t1, division t2, sales t3 where t1.region_id=t3.region_id and t2.division_id=t3.division_id group by t1.region_name, t2.division_name, t3.year End Note: Analytic functions are a very powerful extension to standard SQL. With it we can now answer questions that could previously be done only through procedural methods or via convoluted, poorly performing SQL. If you are a developer writing complex queries, always check to see if analytic functions can be used. I have often been surprised at how easily complex questions can be answered using analytics. If you are a DBA, do yourself a favour and ensure developers working on your databases are aware of analytic functions. Back to the top | ||||||||
|
|