Analytic functions in SQL Server 2012–Part II –

For the rest of this article, suppose you have a table describing events, with the following columns:

  • EmployeeID: the ID of the employee who triggered the event
  • ItemID: the ID of the item for which the event was triggered
  • EventDateTime: when the triggered occurred.

FIRST_VALUE / LAST_VALUE

These functions will get the first/ last value within the specified PARTITION. “First” and “last”  will be defined according to the ORDER specified in the OVER clause.

To get the ID of the employee who created an item (first event), you could use the following expression:

FIRST_VALUE(EmployeeID) OVER (PARTITION BY ItemID ORDER BY EventDateTime ASC)

The expression will return the value of EmployeeID for the row that has the lowest EventDateTime, amongst every rows having the same value for ItemID.

LAST_VALUE fits a similar role, but beware of the default window frame:

When no ROWS or RANGE clause has been specified in the OVER clause, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used.

Does anybody see any use for using LAST_VALUE with this window frame?

LAG / LEAD

LAG and LEAD allow you to get information from the n-th row before or after the current “row”.

The following expression returns the date time of the next event for the current ItemID:

LEAD( EventDateTime, 1, GETDATE() )  OVER (PARTITION BY ItemID ORDER BY EventDateTime)

LAG and LEAD accept no window frame (no implicit ROWS or RANGE clause).

Are these functions deterministic?

When you use these functions, remember the following: if the value you wish to return is not included in the ORDER BY clause, rows with different values may have the same order. In that case, you cannot guarantee which value will be returned.

Consider the following example:

WITH DATA (partitionID, orderingValue, Value) AS (
SELECT 1, 100, ‘VAL1’ UNION ALL
SELECT 1, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL1’
)
SELECT
*
INTO #TMP
FROM DATA;

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

CREATE CLUSTERED INDEX myTempIndex
ON #TMP (partitionID, orderingValue, Value);

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

DROP INDEX myTempIndex ON #TMP;
DROP TABLE #TMP;

Creating a clustered INDEX on the table changed the outcome of the query.

Even without changes to an INDEX, I suspect a change in an execution plan may also lead to such changes.

Advertisements

Analytic functions in SQL Server 2012 – Part I – Distributions

In this article, I take a look at the analytic functions in SQL Server 2012 to handle frequency distributions.

CUME_DIST

The CUME_DIST function returns the percentage of records that are of same or lower order than the current record.

The expression:
CUME_DIST() OVER(ORDER BY MyValue)
Is equivalent (neglecting precision) to :
1.0 * COUNT(*)
OVER (ORDER BY MyValue
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
/ COUNT(*) OVER ()

Possible scenario: calculate the percentage of households whose income is not greater than the current one.

PERCENT_RANK

The PERCENT_RANK rank is similar to the CUME_DIST function.

The expression:
PERCENT_RANK() OVER(ORDER BY MyValue)

Is equivalent to (neglecting integer division) :

(  RANK() OVER (ORDER BY MyValue) –1      )
/ ( COUNT(*) OVER ()         –1      )

Possible scenario: for each household, calculate the percentage of the other households that earn less than the current one.

PERCENTILE_DISC

Returns the smallest value such that CUME_DIST is equal to or greater than the provided probability.

PERCENTILE_DISC (0.4)
WITHIN GROUP ( ORDER BY MyValue ASC )
OVER()

A few remarks:

  • NULL values are ignored, although this is not the case for the CUME_DIST function;
  • The OVER clause is mandatory, although it may be empty;
  • No ORDER BY is allowed in the OVER clause; there is a specific WITHIN GROUP clause to specify the ordering of the partition.

Possible scenario: what is the income under which 10% of households fall?

PERCENTILE_CONT

This is an interpolated version of PERCENTILE_DISC. It shares the same syntax. The same remarks as above apply.

Fun facts about Transact-SQL syntax

Magic COUNT

The following query is valid:

SELECT COUNT(*)

It returns 1. Can someone see why?

GROUP BY nothing

The following group by clause is valid:

GROUP BY ()

It groups by “nothing”. A query with such a GROUP BY clause will only compile if it only contains aggregate expressions. As a result, there barely is a point in using it, except maybe if you dynamically build your SQL.

OVER the whole set

If you need aggregates over a complete set, you can use one of the following OVER clauses:

OVER (PARTITION BY 1)

OVER ()

ALL and ANY

ALL and ANY cannot be used on the left-hand side of a scalar comparison.

This query is valid:

1 = ALL( SELECT 1)

This query is not :

ALL( SELECT 1) = 1

%d bloggers like this: