Using a self-signed SSL certificate with SQL Server

Learning to configure SQL Server to use a self-signed SSL certificate was not really part of my training plan. However, Data Explorer required SSL encryption to connect to SQL Server. On the same day I managed to make it work, the DE team released an update that waives this requirement …* sigh* …

The steps described below are just for development and tests, to connect to a local instance of SQL Server. Best practices for SSL encryption and server security may look different.

Here is a summary of how I did it on my machine (SQL Server 2008, 2012 – Express editions too -Windows 7 Ultimate).

Short version

To configure SQL Server to use a self-signed SSL certificate, you will have to:

  1. Create a self-signed certificate
  2. Set permissions for this certificate
  3. Configure SQL Server to use this certificate

Point number 2 is the one I had most problems with. The SQL Server log had entries like these:

  • “The server could not load the certificate it needs to initiate an SSL connection. It returned the following error: 0x8009030d. Check certificates to make sure they are valid.”
  • “Unable to load user-specified certificate …”
  • “… Unable to initialize SSL support. Cannot find object or property.“

Solution: Grant SQL Server rights to read the private key for the certificate.

All steps are described below.

Create a certificate

With IIS

If you have IIS on your machine:

  • Start IIS Manager
  • Go to Server Certificates
  • Right Click > Create Self-Signed Certificate

With makecert

The process as explained by SQL Server BOL uses the makecert command line utility. Since I could not find makecert , until I had solved my problem, I did not follow BOL. You can find the tool here.

Set permissions for the certificate

Find the service account name for your SQL Server instance

  • Start SQL Server Configuration Manager
  • Go to SQL Server Services
  • Select your instance
  • Right-click Properties
  • Copy the account name for later re-use

Launch the Certificates snap-in

One way to manage your certificates is to

  • Launch MMC (type mmc.exe in the Windows search bar)
  • Go to File > Add / Remove Snap-in …
  • Select Certificates ( when prompted choose Computer Account)

Grant SQL Server rights to read the private key for the certificate.

  • In MMC, go to Certificates (Local computer) > Personal > Certificates
  • The certificate should be listed there. ( If  you created the certificate using makecert, you may have to import it.)
  • Right click > All Tasks > Manage Private Keys
  • Add the service account for your instance of SQL Server
  • Give the service account Read permissions

Configure SQL Server to use this certificate

  • Start SQL Server Configuration Manager
  • Go to SQL Server Network configuration
  • Select your instance
  • Right-click > Properties > Certificate tab
  • Choose the certificate you just created
  • Restart the service

Conclusion

I tried to make this post as detailed as possible, but not too detailed. Any questions, suggestions, or corrections? Use the comments below.

Advertisements

Equivalent of VLOOKUP in DAX – Part II – using TOPN

In my previous entry, I presented the LOOKUPVALUE function. In this article, I will explain how you can use the TOPN function to mimic the behavior of VLOOKUP when doing “approximate” searches.

Ken Puls’s article presents a typical use case for this pattern. Let me start with briefly sketching his problem and describe how he solves it.

The effective tax rate problem

Ken uses data representing sales at a daily level. His problem is to determine the effective tax rate for a given row, when tax rates change over time.

In his article, he uses meaningful sales data. Since the core of the problem is to find the relevant tax rate for a specific date, and I want to focus on that, my ‘Data’ table looks like this:

Date
06/01/2010
07/01/2010
08/01/2010
03/01/2013
04/01/2013
05/01/2013

Here is the ‘Tax Rates’ table. [Start Date] represents the day when a tax rate became effective.

Start Date Tax Rate
07/01/2010 0,12
04/01/2013 0,05

Note that Gerhard Brueckl’s problem is similar: he presents a way to assign the relevant fiscal period to a date in a dimension table. In his model, fiscal periods are defined by their start date as well.

A solution

Ken Puls’ idea is to add a calculated column to his data. This column will contain the value of [Start Date] for the relevant tax rate. He will then create a relationship between the ‘Data’ table and ‘Tax Rates’ table based on this new column.

His calculated column contains the MAX value of [Start Date] for all rows of ‘Tax Rate’ where ‘Tax Rate’[Start Date] is not greater than the current date in the data table. In other words, he replicates the logic of the following Excel formula:

= VLOOKUP( [Date] , 'Tax Rates', 1, TRUE)

The DAX expression for a calculated column looks like this:

[Tax Rate Start Date] := 
 CALCULATE(  MAX( 'Tax Rates'[Start Date]) 
   , FILTER('Tax Rates', 'Tax Rates'[Start Date] <= 'Data'[Date] )  
 )

It filters the ‘Tax Rates’ table, and calculates the max value of ‘Tax Rates'[Start Date] for the filtered table.

Extending the solution (TOPN)

We can use TOPN to mimic the behaviour of VLOOKUP.

For an approximate search, we will follow this logic:

  • use a FILTER expression to eliminate irrelevant rows (as in the previous formula)
  • use the TOPN function to filter the result and only return the last row in ‘Tax Rates’
  • return the value in the required column for the row we just found.

Here, “the last row”  means the row with the latest [Start Date].

Let us rewrite Ken’s expression using the TOPN function:

[Effective Tax Rate - Start Date] :=
 CALCULATE (  VALUES('Tax Rates'[Start Date])
  , TOPN(1
  , FILTER('Tax Rates', 'Tax Rates'[Start Date] <= 'Data'[Date]  )
   , 'Tax Rates'[Start Date]
    )
 )

The advantage of this pattern is that we can use it to query any column in our lookup table:

[Effective Tax Rate - Rate] :=
 CALCULATE ( VALUES('Tax Rates'[Tax Rate])
  , TOPN(1
   , FILTER('Tax Rates', 'Tax Rates'[Start Date] <= 'Data'[Date]  )
   , 'Tax Rates'[Start Date]
    )
 )

Dealing with ties

There might be a problem with the previous formula: TOPN may return more than 1 row. This will be the case, if there are ties in your data.

In our example, if the ‘Tax Rates’ table has several entries for the same [Start Date], then the VALUES expression might return more than one value. In that case, the expression will fail.

Examples:

Start Date
Tax Rate
Start Date
Tax Rate
07/01/2010 0,12 07/01/2010 0,12
04/01/2013 0,05

04/01/2013 0,05

07/01/2010 0,12 07/01/2010 0,13
The formula still works. The formula returns an error.

This may be totally fine. Most of the time, this is what we want: if a reference table contains contradictory information, we want to notice it.

However, in some cases, you may want your formula to ignore that, and take one of these values anyway.

In that case, you can add a sort expression to the TOPN expression, so as to guarantee that only one row (or value) is returned.

Edit: Although this is what I first wrote below, SAMPLE does not behave like a TOPN with ties. How I came to this conclusion is unclear, as I can no longer find the test data I used. Although, this might seem to be true in this specific case, you should not assume this will work every time.

However, you can wrap the TOPN expression within a SAMPLE expression to only get one row (or more if required).

You may also use the SAMPLE function instead of the TOPN function. What does it do? Basically the same as TOPN except in case of a tie. In that case, SAMPLE will exactly return the required number of rows.

  • Which rows? The engine will decide by itself.
  • Will it always return the same rows? This is not guaranteed.
  • Will the rows returned change on each execution? This is not guaranteed.

As a side-note, if you want to write a query that returns exactly 3 rows from your ‘Data’ table, but want these rows to be randomly drawn between each execution, then you can do the following.

EVALUATE SAMPLE( 3,'Data', RAND() )

Bug in PowerPivot 2012 for Excel 2010

I spotted what I firmly believe to be a bug in PowerPivot. If you already met this problem and/or can reproduce this issue, please support this ticket on MS Connect.

Edit: This bug was fixed with SQL Server 2012 Service Pack 2.

An MS employee might – someday , maybe – respond to my bug report and say this is “by design”. This employee may also advance some pretty good reasons why this is so. However, until then, I cannot conceive of any good reason for this behavior.

Introduction

The data

For the rest of this post, I will use an embedded PowerPivot model based on a linked table. The table is called ‘SomeData’:

Country Product OtherAttribute
USA A Y
FRA B Y
USA C N
FRA C N
UK A N
GER B N
USA A N
FRA B N
FRA D N

What is this bug?

Instead of an elaborate discussion about this bug, let me just start introducing it with a mini-quiz.

Questions

Question 1:

Our model only contains the table ‘SomeData’ presented above. We run the following query against our model.

EVALUATE
CALCULATETABLE(     VALUES(‘SomeData’) ,
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country], [Product]
)
)

What does the query return?

  1. All rows for which Country = “USA” and Product = “A” or Country = “FRA” and Product = “B”.
  2. All rows for which Country = “USA” or Country = “FRA”
  3. All rows for which Product = “A” or Product = “B”.
  4. It depends

Question 2:

Our model only contains the table ‘SomeData’ presented above. We run the following query against our model.

EVALUATE
CALCULATETABLE( ‘SomeData’,
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country], [Product]
)
)

What does the query return?

  1. All rows for which Country = “USA” and Product = “A” or Country = “FRA” and Product = “B”.
  2. All rows for which Country = “USA” or Country = “FRA”
  3. All rows for which Product = “A” or Product = “B”.
  4. It depends

Answers:

Question 1:

1. All rows for which Country = “USA” and Product = “A” or Country = “FRA” and Product = “B”.

Question 2:

4. It depends: The result can be either 2 or 3.

Stating the problem

The only difference between both queries, is the presence (or absence) of the VALUES function in the first argument. This means the expression passed as a first argument to the CALCULATE statement changes the way the 2nd argument is evaluated. This is only supposed to work the other way around.

In the second case, the evaluation of the context is incorrect. PowerPivot will only filter on one column. Which one depends on the order of the columns in the ‘SomeData’ Excel table.

As a result, a query might  have different results depending on how the workbook was built.

A working example (direct filtering) …

Returning an arbitrary set of tuples

Consider the following query:

EVALUATE
SUMMARIZE(
FILTER( ‘SomeData’
, [Country] = “USA” && [Product] = “A”
|| [Country] = “FRA” && [Product] = “B”
)
, [Country], [Product]
)

This query returns the set of (Country, Product) tuples according to my predicates:

SomeData[Country] SomeData[Product]
USA A
FRA B

Filtering data based on a calculated set

If I want to get all rows from ‘SomeData’, where the tuple (Product,  Country) matches one of the tuples in the previous set, I can use the previous table expression as the 2nd argument of a CALCULATETABLE expression. This gives the following query:

EVALUATE
CALCULATETABLE( ‘SomeData’,
SUMMARIZE(
FILTER( ‘SomeData’
, [Country] = “USA” && [Product] = “A”
|| [Country] = “FRA” && [Product] = “B”
)
, [Country], [Product]
)
)

The results:

SomeData[Country] SomeData[Product] SomeData[OtherAttribute]
USA A Y
FRA B Y
USA A N
FRA B N

So far so good.  Everything works as expected.

… that no longer works (cross-filtering)

Returning an arbitrary set of tuples

Instead of explicitly filtering the (Country, Product) tuples, as above, let us now filter our data on the [OtherAttribute] column.

EVALUATE
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country], [Product]
)

As before, this returns:

SomeData[Country] SomeData[Product]
USA A
FRA B

Filtering data based on a calculated set

Let us plug, our new SUMMARIZE expression into our CALCULATETABLE expression.

EVALUATE
CALCULATETABLE( ‘SomeData’,
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country], [Product]
)
)

Unexpectedly, this time, the query returns the following results:

SomeData[Country] SomeData[Product] SomeData[OtherAttribute]
USA A Y
FRA B Y
UK A N
GER B N
USA A N
FRA B N

Note that the engine now returns all rows from ‘SomeData’ for which [Product] is in { “A”, “B” }. This is confirmed by SQL Server Profiler.

Replacing ‘SomeData’ with VALUES(‘SomeData’), however, will return the expected results.

Reordering the columns

Worse, let us just change the order of the columns in our Excel table.

Just drag and drop the [Country] column to the left of the [Product] column, and refresh the PowerPivot model. We have not made any changes to the PowerPivot model itself.

Let us run our previous query again.

The results now become:

Country Product OtherAttribute
USA A Y
FRA B Y
USA C N
FRA C N
USA A N
FRA B N
FRA D N

Which means, the query now only filters by Country. Once again, this is confirmed by SQL Server Profiler.

When a filter context is available

Finally, let us add another CALCULATETABLE expression, so that our query becomes:

EVALUATE
CALCULATETABLE(
CALCULATETABLE( ‘SomeData’,
SUMMARIZE(
FILTER( (‘SomeData’), [OtherAttribute] = “Y”)
, [Country], [Product]
)
)
, ‘SomeData'[Country] = “FRA”)

‘SomeData’ is first filtered by [Country], which only returns the rows for France. These data should then be filtered according to the value in [OtherAttribute], then be summarized by [Country] and [Product]. The one tuple remaining (“FRA”, “B”) should then be used to  filter ‘SomeData’ – overriding our initial filter. This is not the case. Here, the product will only get filtered by [Country].

Once again, replacing ‘SomeData’ in the first argument of the innermost CALCULATETABLE expression with VALUES(‘SomeData’) will produce the expected results.

In that case, however, the result of the query does not seem to depend on the order of the columns in the Excel table.

Measures display the same behavior

Example with COUNTROWS

Do measures display the same behavior? Unfortunately, yes.

Let us consider this measure:

[Fine] :=

    CALCULATE( COUNTROWS( VALUES(‘SomeData’) ),
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country],[Product]
)
)

This measure works as expected:

Fine
A B Total
FRA 2 2
USA 2 2
Total 2 2 4

For the same reasons as in the previous sections, this measure does not:

[Not fine] :=

    CALCULATE( COUNTROWS( ‘SomeData’ ),
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country],[Product]
)
)

The result:

Not fine
A B Total
FRA 2 4
USA 2 3
Total 3 3 6

Once again the grand total may change depending on the order of the columns in the linked table.

Example with SUM

You might think: “Not an issue. I always use VALUES when referencing tables or columns”.

Do  you?

Standard aggregate functions like COUNT, COUNTA, SUM, … only accept a direct reference to a column.

To check this, let us add a new column to our table:

Country Product OtherAttribute CheckColumn
USA A Y 100000000
FRA B Y 20000000
USA C N 3000000
FRA C N 400000
UK A N 50000
GER B N 6000
USA A N 700
FRA B N 80
FRA D N 9

Now, let us observer the same unexpected behavior.

[Sum fails] : =

CALCULATE(   SUM(‘SomeData'[CheckColumn]   ),
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country],[Product]
)
)

The results:

Sum fails
A B Total
FRA 20000080 20400089
USA 100000700 103000700
Total 100050700 20006080 120056780

Note that it should be easy to track which row is in the context for each cell:

For example, the total for FRA is 20400089.  This clearly indicates the last two rows in our previous table (where [CheckColumn] = 80 and 9 ) were included in our table.

Here is one way to make it work as expected.

[Sum works]: =

CALCULATE(
CALCULATE( SUM(‘SomeData'[CheckColumn]   ), VALUES(‘SomeData’)) ,
SUMMARIZE(
FILTER( ‘SomeData’, [OtherAttribute] = “Y”)
, [Country],[Product]
)
)

Sum works
A B Total
FRA 20000080 20000080
USA 100000700 100000700
Total 100000700 20000080 120000780

Analytic functions in SQL Server 2012–Part II –

For the rest of this article, suppose you have a table describing events, with the following columns:

  • EmployeeID: the ID of the employee who triggered the event
  • ItemID: the ID of the item for which the event was triggered
  • EventDateTime: when the triggered occurred.

FIRST_VALUE / LAST_VALUE

These functions will get the first/ last value within the specified PARTITION. “First” and “last”  will be defined according to the ORDER specified in the OVER clause.

To get the ID of the employee who created an item (first event), you could use the following expression:

FIRST_VALUE(EmployeeID) OVER (PARTITION BY ItemID ORDER BY EventDateTime ASC)

The expression will return the value of EmployeeID for the row that has the lowest EventDateTime, amongst every rows having the same value for ItemID.

LAST_VALUE fits a similar role, but beware of the default window frame:

When no ROWS or RANGE clause has been specified in the OVER clause, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used.

Does anybody see any use for using LAST_VALUE with this window frame?

LAG / LEAD

LAG and LEAD allow you to get information from the n-th row before or after the current “row”.

The following expression returns the date time of the next event for the current ItemID:

LEAD( EventDateTime, 1, GETDATE() )  OVER (PARTITION BY ItemID ORDER BY EventDateTime)

LAG and LEAD accept no window frame (no implicit ROWS or RANGE clause).

Are these functions deterministic?

When you use these functions, remember the following: if the value you wish to return is not included in the ORDER BY clause, rows with different values may have the same order. In that case, you cannot guarantee which value will be returned.

Consider the following example:

WITH DATA (partitionID, orderingValue, Value) AS (
SELECT 1, 100, ‘VAL1’ UNION ALL
SELECT 1, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL2’ UNION ALL
SELECT 2, 100, ‘VAL1’
)
SELECT
*
INTO #TMP
FROM DATA;

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

CREATE CLUSTERED INDEX myTempIndex
ON #TMP (partitionID, orderingValue, Value);

SELECT *
, FIRST_VALUE(Value) OVER (PARTITION BY partitionID ORDER BY orderingValue)
FROM #TMP;

DROP INDEX myTempIndex ON #TMP;
DROP TABLE #TMP;

Creating a clustered INDEX on the table changed the outcome of the query.

Even without changes to an INDEX, I suspect a change in an execution plan may also lead to such changes.

Analytic functions in SQL Server 2012 – Part I – Distributions

In this article, I take a look at the analytic functions in SQL Server 2012 to handle frequency distributions.

CUME_DIST

The CUME_DIST function returns the percentage of records that are of same or lower order than the current record.

The expression:
CUME_DIST() OVER(ORDER BY MyValue)
Is equivalent (neglecting precision) to :
1.0 * COUNT(*)
OVER (ORDER BY MyValue
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
/ COUNT(*) OVER ()

Possible scenario: calculate the percentage of households whose income is not greater than the current one.

PERCENT_RANK

The PERCENT_RANK rank is similar to the CUME_DIST function.

The expression:
PERCENT_RANK() OVER(ORDER BY MyValue)

Is equivalent to (neglecting integer division) :

(  RANK() OVER (ORDER BY MyValue) –1      )
/ ( COUNT(*) OVER ()         –1      )

Possible scenario: for each household, calculate the percentage of the other households that earn less than the current one.

PERCENTILE_DISC

Returns the smallest value such that CUME_DIST is equal to or greater than the provided probability.

PERCENTILE_DISC (0.4)
WITHIN GROUP ( ORDER BY MyValue ASC )
OVER()

A few remarks:

  • NULL values are ignored, although this is not the case for the CUME_DIST function;
  • The OVER clause is mandatory, although it may be empty;
  • No ORDER BY is allowed in the OVER clause; there is a specific WITHIN GROUP clause to specify the ordering of the partition.

Possible scenario: what is the income under which 10% of households fall?

PERCENTILE_CONT

This is an interpolated version of PERCENTILE_DISC. It shares the same syntax. The same remarks as above apply.

Fun facts about Transact-SQL syntax

Magic COUNT

The following query is valid:

SELECT COUNT(*)

It returns 1. Can someone see why?

GROUP BY nothing

The following group by clause is valid:

GROUP BY ()

It groups by “nothing”. A query with such a GROUP BY clause will only compile if it only contains aggregate expressions. As a result, there barely is a point in using it, except maybe if you dynamically build your SQL.

OVER the whole set

If you need aggregates over a complete set, you can use one of the following OVER clauses:

OVER (PARTITION BY 1)

OVER ()

ALL and ANY

ALL and ANY cannot be used on the left-hand side of a scalar comparison.

This query is valid:

1 = ALL( SELECT 1)

This query is not :

ALL( SELECT 1) = 1

OVER clause in SQL Server 2012

Changes to the OVER clause

RANGE / ROWS clause

The new ROWS keyword in SQL Server 2012 expression allows us to specify the window frames over which our expressions will be evaluated.

The following ROWS clause, for example, allows us to restrict a calculation to all previous rows in a dataset for the defined partition and the provided ORDER BY clause
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Note that CURRENT ROW actually means all rows in the current partition that have the same ordering value (as defined by the ORDER BY clause) as the current one.

Self joins are no longer required to calculate running values.

As an example, the following query will return an event history from an event list. I considered a CRM system where cases/incidents/tickets can be created, closed and reopened. The raw event list should be transformed into an event history, displaying for each incident the periods over which it was opened.

WITH EventList AS (
SELECT *
FROM (
VALUES
(1, ‘NEW’, CAST(‘2012-01-04 15:00:00’ AS datetime))
,(1, ‘CLOSE’, CAST(‘2012-01-04 15:37:00’ AS datetime))
,(2, ‘NEW’, CAST(‘2012-01-05 15:00:00’ AS datetime))
,(2, ‘CLOSE’, CAST(‘2012-01-06 15:00:00’ AS datetime))
,(2, ‘REOPEN’, CAST(‘2012-01-07 15:00:00’ AS datetime))
,(2, ‘CLOSE’, CAST(‘2012-01-09 15:00:00’ AS datetime))
,(1, ‘REOPEN’, CAST(‘2012-01-18 15:00:00’ AS datetime))
)
AS DATA(ID, EventCode, EventDate)
)
SELECT
ID
, EventCode
, EventDate As StartDate
, MIN(CASE WHEN EventCode = ‘CLOSE’ THEN EventDate END)
OVER (PARTITION BY ID ORDER BY EventDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
AS EndDate
FROM EventList

Nice and simple.

A questionable design decision

Due to a questionable design decision, the behavior of the new extended OVER clause may be counter-intuitive.

Consider the expression:
COUNT(EventDate) OVER (PARTITION BY ID)

This gives you the number of rows for the current ID.

The following expression is now legal in SQL Server 2012 – earlier versions would have thrown an error (Incorrect syntax near ‘order’).
COUNT(EventDate)
OVER (PARTITION BY ID ORDER BY EventDate )

Since COUNT is an additive function – this implies the order of evaluation does not matter – I expected this expression to return exactly the same results. Furthermore, I expected this expression to return exactly the same result independently of the field specified in the ORDER BY clause.

However, if you use such an expression , you will find out SQL Server actually restricts the partition depending on the fields listed in the ORDER BY clause. The MSDN documentation is quite clear about this ‘If ROWS/RANGE is not specified but ORDER BY is specified, RANGE UNBOUNDED PRECEDING AND CURRENT ROW is used as default for window frame.’

If I provide no ORDER BY or ROWS/RANGE clause, SQL Server will assume RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Why should I expect a different behavior, when I only explicitly modify the ORDER BY clause?

%d bloggers like this: