Data Management
Project Management
Archived stuff
Cryptic crossword.
A bit about me.
Read my blog
Recent additions
Home
Oracle Articles

Oracle Analytic Functions: An Introduction


Note: A script to create and populate the sample schema can be found here

Introduction:

SQL is a compact yet versatile language - a surprisingly large variety of questions can be answered using standard SQL. However, certain classes of questions are extremely hard to answer using standard SQL. For example consider the following:
  • Ranking data within subsets of a data set.
  • Aggregating data within subsets of a data set.
  • Performing aggregations over moving windows.
  • Displaying and comparing aggregates to individual entries within a single query.
  • Comparing two or more rows within a given data set.
Typically one would have to tackle such calculations using complicated (and hence poorly performing) SQL or via procedural methods. In version 8i Oracle introduced analytic functions - extensions to standard SQL - which are designed to tackle such problems . The present article introduces analytic functions, and shows how they can be used to answer such questions. A schema creation script to accompany the discussion can be found here. You may want to create and populate the tables in a test schema so that you can run and explore variations of the SQL discussed in this article.

Ranking functions:

These functions rank records within a set or subset of data according to specified criteria. Their utility is best explained by working through a set of examples. Consider the following: we want to rank divisions and regions by sales for 2004. The traditional SQL solution to the problem would be as follows:

--query (1) - region / division ranked by sales for 2004

select
  region_name,
  division_name,
  sum_amount,
  rownum
from
  (select
    t1.region_name,
    t2.division_name,
    sum(t3.amount) sum_amount
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004
  group by
    t1.region_name,
    t2.division_name
  order by 3 desc)


A subquery is required because Oracle assigns row numbers before the ordering is done. The inner query does the ordering and the outer query does the ranking using rownum. Now compare the above query to the following, much more compact, one which uses an analytical ranking function:

--query (2) - region / division ranked by sales for 2004

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
  rank() over (order by sum(t3.amount) desc) sales_rank
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name


As you'll probably agree, query (2) is far more compact than query (1).

Now to deconstruct the unfamiliar bits in query (2):

  1. As its name suggests, rank() is the ranking function. The over .. syntax associated with rank() indicates that it is an analytic function. This syntax is common to all analytic functions as we'll see in the course of this discussion.
  2. The order by sum(t3.amount) desc tells Oracle that the data is to be ranked by total sales for the year.
So far so good. However, ranking functions really come into their own when there's a need to rank data within subgroups of a result set. Here's the SQL to rank sales by division within each region:

--query (3) - division ranked by 2004 sales for each region

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
  rank() over (partition by t1.region_name order by sum(t3.amount) desc) sales_rank_by_region
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

The only difference between query (2) and (3) is the partition by clause, which tells Oracle that the data is to be ranked within each region. In general, the partition by clause defines subgroups within which the analytic operation (aggregation, ranking etc.) is to be performed.

Other functions with usage similar to rank() are: dense_rank(), percent_rank() cume_dist() and ntile(). Check the Oracle documentation for details on these. The relevant book is the Oracle Database Data Warehousing Guide, available from the Oracle documentation site.

Windowing aggregate functions:

These functions can be used to perform aggregate operations - sums, counts, averages etc - on subgroups of data. Further, these aggregates can be performed over user specified windows or ranges. Again, the syntax is best illustrated through some examples, so here we go. The SQL below returns the monthly and year-to-date sales for 2004 by region and division:

--query (4) - YTD sales by region and division

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
  sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows unbounded preceding) YTD_sales

from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

The syntax, as you'll notice, is very similar to that used for ranking functions. The various analytic clauses are explained below:
  1. The partition by clause defines subgroups within which the aggregation is to be done. In the above example, the sum is to be calculated for each region/division combination.
  2. The order by defines the order in which data is to be cumulated within each subgroup. In the above, data is to be summed year-to-date (also see next item).
  3. The new element, rows unbounded preceding, is a windowing clause. The windowing clause defines the aggregation window -i.e. how many rows are to be included in the aggregation. In the above example, rows unbounded preceding tells Oracle that the data is to be aggregated from the start of the subgroup to the current row. As it happens this is the default, so we could have omitted the rows unbounded preceding clause.
Moving on, the next statement calculates a rolling three monthly sum based on the current and previous two months:

--query (5) - 3 month rolling sum

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
  sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows 2 preceding)
three_mth_sum
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

In contrast, the following SQL calculates a centered three month rolling sum:

--query (6) - centered 3 month rolling sum

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount month_sales,
  sum(t3.amount) over (partition by t1.region_name, t2.division_name
    order by t3.month rows between 1 preceding and 1 following) ctrd_three_mth_sum
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

The windowing clause has other variations based on ranges instead of number of rows. Further, it is possible to vary the window size for each row. Check the Oracle documentation for details.

With the above syntax one can write single SQL statements to answer more complicated questions such as quarterly sales for 2004 by region and division. Here's the SQL:

--query (7) - quarterly sales for 2004 by region and division

select
  region_name,
  division_name,
  month/3 quarter,
  sum_3_months quarterly_sales
from
  (select
    t1.region_name,
    t2.division_name,
    t3.month,
    sum(t3.amount) over (partition by t1.region_name, t2.division_name
      order by t3.month rows 2 preceding) sum_3_months
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004)
where
  month in (3,6,9,12)

Here we used analytical SQL to calculate a three monthly sum in the subquery, and then picked out the quarter end months in the outer query.

We have dealt exclusively with sum in this discussion. There are quite a few other aggregating functions. Some commonly used ones include: avg, max, min and count. Check the Oracle documentation for more details.

Reporting aggregate functions:

Reporting aggregate functions return an aggregated value for the entire partition. This value, which is the same for each row in the partition, is returned with each row in the partition. Available functions include: sum, avg, max, min and count. The essential difference between reporting and windowing aggregations is the absence of a windowing clause (order by) in the former. Here's an example that calculates the maximum monthly sales by region and division:

--query (8) - maximum monthly sales for 2004 by region and division (all months)

select
  t1.region_name,
  t2.division_name,
  t3.month,
  t3.amount mthly_sales,
  max(t3.amount) over (partition by t1.region_name, t2.division_name)
    max_mthly_sales
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004

This returns the maximum sales by region and division with every row - the same value is returned with each row. We can now use the above query to get just the maximum sales months like so:

--query (9) - maximum monthly sales for 2004 by region / division (max mths only)

select   region_name,
  division_name,
  month max_sales_month,
  max_mthly_sales
from
  (select
    t1.region_name,
    t2.division_name,
    t3.month,
    t3.amount mthly_sales,
    max(t3.amount) over (partition by t1.region_name, t2.division_name)
      max_mthly_sales
  from
    region t1,
    division t2,
    sales t3
  where
    t1.region_id=t3.region_id
  and
    t2.division_id=t3.division_id
  and
    t3.year=2004)
where
    mthly_sales=max_mthly_sales

Here analytical SQL calculates the maximum in the subquery (same as query (8)), and we pick out the rows for which the maximum equals monthly sales in the outer query.

Oracle also offers the ratio_to_report function, which calculates ratios based on partition-wide aggregates. Here's an example that reports the ratio of sales by region and division to total regional sales:

--query (10) - sales by region/div to total region sales

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
  sum(sum(t3.amount)) over (partition by t1.region_name) region_total,
  ratio_to_report(sum(t3.amount)) over (partition by t1.region_name) region_ratio
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

A couple of points to note in query (10):
  1. Here the ratio_to_report computes the ratio of region/division sum to the regional sum. the partition by clause determines the denominator of the ratio.
  2. Although not required for calculating the ratio, I have also shown how the regional sum can be obtained via summing the region/division sum - this is also an example of a reporting aggregate since there is no windowing clause present.
Finally in case you want to treat the entire data set as a one partition for reporting aggregates and ratios:

--query (11) - sales by region/div to total sales

select
  t1.region_name,
  t2.division_name,
  sum(t3.amount),
  sum(sum(t3.amount)) over () region_total,
  ratio_to_report(sum(t3.amount)) over () region_ratio
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
and
  t3.year=2004
group by
  t1.region_name,
  t2.division_name

Here we've simply omitted the partition by clause, thereby instructing Oracle to treat the entire data set as a single partition.

There is, of course, much more to reporting aggregates. Check the documentation for lots more.

Lag/Lead functions:

Comparing data across rows is hard using standard SQL. The lag and lead analytical functions are designed for just this type of problem. Here's an example:

--query (12) - yearly sales by region/division and comparison to previous year

select
  t1.region_name,
  t2.division_name,
  t3.year,
  sum(t3.amount) yearly_sales,
  sum(t3.amount)-lag(sum(t3.amount),1) over (partition by t1.region_name, t2.division_name
    order by t3.year) diff_prev_year
from
  region t1,
  division t2,
  sales t3
where
  t1.region_id=t3.region_id
and
  t2.division_id=t3.division_id
group by
  t1.region_name,
  t2.division_name,
  t3.year

Here the lag function compares sales for 2004 to that for 2003. The usage for lead is very similar - see the documentation for some examples.

End Note:

Analytic functions are a very powerful extension to standard SQL. With it we can now answer questions that could previously be done only through procedural methods or via convoluted, poorly performing SQL. If you are a developer writing complex queries, always check to see if analytic functions can be used. I have often been surprised at how easily complex questions can be answered using analytics. If you are a DBA, do yourself a favour and ensure developers working on your databases are aware of analytic functions.

Back to the top