Analytic functions in SQL Server 2012–Part II –

For the rest of this article, suppose you have a table describing events, with the following columns:

  • EmployeeID: the ID of the employee who triggered the event
  • ItemID: the ID of the item for which the event was triggered
  • EventDateTime: when the triggered occurred.

FIRST_VALUE / LAST_VALUE

These functions will get the first/ last value within the specified PARTITION. “First” and “last”  will be defined according to the ORDER specified in the OVER clause.

To get the ID of the employee who created an item (first event), you could use the following expression:

FIRST_VALUE(EmployeeID) OVER (PARTITION BY ItemID ORDER BY EventDateTime ASC)

The expression will return the value of EmployeeID for the row that has the lowest EventDateTime, amongst every rows having the same value for ItemID.

LAST_VALUE fits a similar role, but beware of the default window frame:

When no ROWS or RANGE clause has been specified in the OVER clause, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used.

Does anybody see any use for using LAST_VALUE with this window frame?

LAG / LEAD

LAG and LEAD allow you to get information from the n-th row before or after the current “row”.

The following expression returns the date time of the next event for the current ItemID:

LEAD( EventDateTime, 1, GETDATE() )  OVER (PARTITION BY ItemID ORDER BY EventDateTime)

LAG and LEAD accept no window frame (no implicit ROWS or RANGE clause).

Are these functions deterministic?

When you use these functions, remember the following: if the value you wish to return is not included in the ORDER BY clause, rows with different values may have the same order. In that case, you cannot guarantee which value will be returned.

Consider the following example:

WITH DATA (partitionID, orderingValue, Value) AS (
SELECT 1, 100, ‘VAL1’ UNION ALL
SELECT 1, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL1’
)
SELECT
*
INTO #TMP
FROM DATA;

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

CREATE CLUSTERED INDEX myTempIndex
ON #TMP (partitionID, orderingValue, Value);

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

DROP INDEX myTempIndex ON #TMP;
DROP TABLE #TMP;

Creating a clustered INDEX on the table changed the outcome of the query.

Even without changes to an INDEX, I suspect a change in an execution plan may also lead to such changes.

Analytic functions in SQL Server 2012 – Part I – Distributions

In this article, I take a look at the analytic functions in SQL Server 2012 to handle frequency distributions.

CUME_DIST

The CUME_DIST function returns the percentage of records that are of same or lower order than the current record.

The expression:
CUME_DIST() OVER(ORDER BY MyValue)
Is equivalent (neglecting precision) to :
1.0 * COUNT(*)
OVER (ORDER BY MyValue
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
/ COUNT(*) OVER ()

Possible scenario: calculate the percentage of households whose income is not greater than the current one.

PERCENT_RANK

The PERCENT_RANK rank is similar to the CUME_DIST function.

The expression:
PERCENT_RANK() OVER(ORDER BY MyValue)

Is equivalent to (neglecting integer division) :

(  RANK() OVER (ORDER BY MyValue) –1      )
/ ( COUNT(*) OVER ()         –1      )

Possible scenario: for each household, calculate the percentage of the other households that earn less than the current one.

PERCENTILE_DISC

Returns the smallest value such that CUME_DIST is equal to or greater than the provided probability.

PERCENTILE_DISC (0.4)
WITHIN GROUP ( ORDER BY MyValue ASC )
OVER()

A few remarks:

  • NULL values are ignored, although this is not the case for the CUME_DIST function;
  • The OVER clause is mandatory, although it may be empty;
  • No ORDER BY is allowed in the OVER clause; there is a specific WITHIN GROUP clause to specify the ordering of the partition.

Possible scenario: what is the income under which 10% of households fall?

PERCENTILE_CONT

This is an interpolated version of PERCENTILE_DISC. It shares the same syntax. The same remarks as above apply.

Fun facts about Transact-SQL syntax

Magic COUNT

The following query is valid:

SELECT COUNT(*)

It returns 1. Can someone see why?

GROUP BY nothing

The following group by clause is valid:

GROUP BY ()

It groups by “nothing”. A query with such a GROUP BY clause will only compile if it only contains aggregate expressions. As a result, there barely is a point in using it, except maybe if you dynamically build your SQL.

OVER the whole set

If you need aggregates over a complete set, you can use one of the following OVER clauses:

OVER (PARTITION BY 1)

OVER ()

ALL and ANY

ALL and ANY cannot be used on the left-hand side of a scalar comparison.

This query is valid:

1 = ALL( SELECT 1)

This query is not :

ALL( SELECT 1) = 1

OVER clause in SQL Server 2012

Changes to the OVER clause

RANGE / ROWS clause

The new ROWS keyword in SQL Server 2012 expression allows us to specify the window frames over which our expressions will be evaluated.

The following ROWS clause, for example, allows us to restrict a calculation to all previous rows in a dataset for the defined partition and the provided ORDER BY clause
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Note that CURRENT ROW actually means all rows in the current partition that have the same ordering value (as defined by the ORDER BY clause) as the current one.

Self joins are no longer required to calculate running values.

As an example, the following query will return an event history from an event list. I considered a CRM system where cases/incidents/tickets can be created, closed and reopened. The raw event list should be transformed into an event history, displaying for each incident the periods over which it was opened.

WITH EventList AS (
SELECT *
FROM (
VALUES
(1, ‘NEW’, CAST(‘2012-01-04 15:00:00’ AS datetime))
,(1, ‘CLOSE’, CAST(‘2012-01-04 15:37:00’ AS datetime))
,(2, ‘NEW’, CAST(‘2012-01-05 15:00:00’ AS datetime))
,(2, ‘CLOSE’, CAST(‘2012-01-06 15:00:00’ AS datetime))
,(2, ‘REOPEN’, CAST(‘2012-01-07 15:00:00’ AS datetime))
,(2, ‘CLOSE’, CAST(‘2012-01-09 15:00:00’ AS datetime))
,(1, ‘REOPEN’, CAST(‘2012-01-18 15:00:00’ AS datetime))
)
AS DATA(ID, EventCode, EventDate)
)
SELECT
ID
, EventCode
, EventDate As StartDate
, MIN(CASE WHEN EventCode = ‘CLOSE’ THEN EventDate END)
OVER (PARTITION BY ID ORDER BY EventDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS EndDate
FROM EventList

Nice and simple.

A questionable design decision

Due to a questionable design decision, the behavior of the new extended OVER clause may be counter-intuitive.

Consider the expression:
COUNT(EventDate) OVER (PARTITION BY ID)

This gives you the number of rows for the current ID.

The following expression is now legal in SQL Server 2012 – earlier versions would have thrown an error (Incorrect syntax near ‘order’).
COUNT(EventDate)
OVER (PARTITION BY ID ORDER BY EventDate )

Since COUNT is an additive function – this implies the order of evaluation does not matter – I expected this expression to return exactly the same results. Furthermore, I expected this expression to return exactly the same result independently of the field specified in the ORDER BY clause.

However, if you use such an expression , you will find out SQL Server actually restricts the partition depending on the fields listed in the ORDER BY clause. The MSDN documentation is quite clear about this ‘If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for window frame.’

If I provide no ORDER BY or ROWS/RANGE clause, SQL Server will assume RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Why should I expect a different behavior, when I only explicitly modify the ORDER BY clause?