Microsoft Data Explorer–a review

As you are probably aware, Microsoft released a Preview version of the Data Explorer add-In. It is currently available for Excel 2010 and Excel 2013. Hopefully, this will still be true for the release version (and not only for VL customers).

Mashing up data from different sources has never been the great strength of Excel. Sure, it was possible, but implied a mix of queries formulas, and macros that often led to crappy convoluted workbooks. Data Explorer allows users to create data mash-ups from within Excel, using a single interface and a consistent expression language.

The ribbon

The interface is good, superior to the one provided in PowerPivot, or DQS (other MS tools that allow business users to actively contribute to the information system.)

The seamless integration with Excel is a huge advantage over similar tools (say, Google Refine).

Since the alpha version, things have changed substantially. The Excel add-in now comes with a ribbon, which is clean and efficient, as well as custom task panes in the Modern UI (aka Windows 8) style.

image

You first connect to a source, then shape your query right away. Many sources are available, including tables in the current workbook. You can also reference Data Explorer queries in other queries. This means queries can be re-used. You do not have to rewrite the same query every time. The Merge and Append options allow you to do joins and unions.

Data Explorer accepts hierarchical data sources, not just flat ones. In simple terms, a cell in a column can contain a table. In the query editor, you can navigate to a nested table by double-clicking it. The interface provides a consistent experience across all types of sources.

The editor

The editor was revamped and streamlined. It is intuitive and works smoothly.

image

The ribbon is gone, which is not a bad thing. The formula bar allows you to type expressions, using the “M” language that Data Explorer uses (more on that later). At the time, there is no support for an Intellisense-like feature, and no function wizard as you may know it from Excel.

However, you can do many things from the interface, without ever writing an expression: you can filter, group and transform data in columns … by right-clicking the right object and choosing the relevant transformation. The generated expression can be seen (and edited) in the formula bar.

Each transformation you apply creates a new step in your query. You can rearrange your query by moving steps up or down (except the Source step which you cannot rename or move): did you want to summarize your data, and then filter instead of filtering then summarizing? No problem. Just move the relevant step up or down.

This has a few advantages:

  • The workflow is natural. You can edit one transformation at a time.
  • For every step, you get a visual feedback of the transformation you just applied.
  • This allows you to keep your queries tidy and avoid unreadable nested calculations

When you write your own expressions, you can choose between chaining multiple steps or nesting/combining expressions. The only thing you should actually be worried about is whether your decision will make the workflow easier to read.

You can rename queries from the designer, but also rename each step. As far as I know, this is the only way to document steps in a query.

Adding calculated columns to a query might require you to write expressions. In many use cases, this will be easy. However, some common calculations (dates, string manipulations, …) require a call to a specific function. For the moment, due to the state of the documentation, finding the function you need requires extra work.

I hope Microsoft will not forget to improve the documentation, once the product gets released. The basic documentation could afford some formatting. The language specification documents will probably be too dense for most users.

The expression language

Data Explorer uses the “M” language, which is expression-based.

This means this language is about writing expressions (formulas) just like you would in Excel or DAX. From what I could take from the documentation and quick experimentations, the language is actually extremely powerful.

A bunch of people at Microsoft Research, are very much into functional programing: if you liked LINQ and F#, then you will like the M language. (Microsoft once had a project called Oslo which featured a language  called “M”. Is it the same? I could not recognize it.)

The language supports different types of values: primitive values (string, numbers, …) but also lists, records, tables, and functions.

This means an expression can return a list, or records, or tables, or functions. A data explorer query can return a function, or a table of functions. You can apply fold operations on lists, define functions within a query, create functions that returns functions, or accept functions as parameters … The language supports closure, recursion …

  • Return a list of number:
    Source := { 1 , 2, 3 }
  • Return numbers from 1 to 10:
    Source  := { 1 … 10 }
  • Return a record with A, B, C where C is a calculated column :
    Source := [ A=1,   B=2, C = A + B ]
  • Return a function, and apply it to a value:
    Source := (x) => x + 1
    InvokedSource := Source(12)
  • Return a string that displays numbers from 1 to 10, separated by a comma
    Source := Text.RemoveRange(List.Accumulate({ 1 .. 10 } ,””, (state, acc) => state & “, ”  & Text.From(acc)  ) ,  0, 2)

You could create a query that takes a document as a source, parses the document and returns a function!

What you will be able to do with the language is absolutely huge. However, as I said before, most use cases will not require a full command of the language.

Conclusion

Data Explorer is not Excel: even if you can do  a lot without knowing much of the language, you may need to learn new functions.

However, due to an intuitive interface and a powerful expression language, the tool might appeal to a large audience from the slightly advanced Excel user, to the R addict.

I am not certain it will appeal to all types of developers, though:

  • I do not expect every C# developer to be a fan of functional programing
  • for the moment, Data Explorer is only available in Excel

On the other hand, this will not be a problem if Data Explorer queries (transformations) can be published as data services, or re-used within SSIS packages.

This leads to the following concern: The alpha version was very much about publishing your queries to the cloud. However, I could not find any reference to this scenario, whether in the ribbon or in the documentation. I hope this is an oversight on my part and this feature will be supported.

Data Explorer comes as a bunch of .Net dlls and an interface written in HTML/JavaScript. I really hope the guys at Microsoft will document the libraries, and provide an official entry point so that developers can add custom functionalities to the query designer, or build custom apps on top of the engine.

Finally, Data Explorer might even be too good to just be an Excel add-in. In the future, I could very well imagine never using an Excel formula again (at least if queries could be automatically updated upon data entry.) That summarizes how impressed I was with the product.

What’s new for Excel 2013 – Personal observations

The customer preview of Office 2013 is finally available, and there is a lot of new things to look at, be it for users or for developers.

Here are a few personal observations based on my first look at the product.

  • PowerPivot for Office 2013  now comes bundled with Excel, making Office On Demand, the evolution of the Click-To-Run technology, much more viable.
  • The PowerPoint add-in is no longer required to build simple data models and pivot tables based on several data tables. Measures, KPIs and some other features will require the add-in to be activated.
  • The message “To use multiple table in your analysis, a new PivotTable needs to be created based on the Data Model.” hints at the fact that old-school pivot tables are not dead yet and still are the default PivotTable type.
  • New table objects in your workbook will automatically be added to your data model.
  • You can create a pivot table based on a simple range and then transform the pivot table into a “PowerPivot model”. The initial range will not be transformed into a table object. Further ranges can be added through the Connections manager. (What for?)
  • The data model can now be accessed from VBA. The corresponding object is called Model (not DataModel), despite what the MSDN documentation mentions about it.
  • The PowerPivot add-in will not recognize tables that were added directly to the data model from Excel as linked tables. However, a new connection will be created in Excel, so that refactoring can be done from the Connections manager. A corresponding object called WorksheetDataConnection is available in VBA.
  • You can create two worksheet data connections for the same range. Duplicating ranges for scenarios where the same data must be used twice (for example, when a dimension must play different roles) is no longer required.
  • The function FILTERXML allows you to query an XML document with XPath. The WEBSERVICE function returns an XML text from a web service. ENCODEURL is a helper function to encode a string into a valid URL. All three functions are available in VBA through the WorksheetFunction object.
  • The function FILTERXML returns an array. You can use it in array formulas, or in conjunction with the INDEX function, for example.
  • New functions like SHEET, and SHEETS are available, although it is still unclear to me in which scenarios they will be useful, since no specific function takes a sheet index as a parameter.
  • ISFORMULA and FORMULATEXT are also new in Excel 2013.
  • Examples in the help system are provided as embedded Excel Web App workbooks. This may solve some translation issues that have occurred in the past for non-English Office versions.
  • Internet Explorer is used to navigate the VBA help. You can now alt-tab between the VBA environment and the documentation.

Common queries : Grouping / aggregating over several columns.

I was looking for common SQL queries which could be used as a battery test for business intelligence tools.

I found quite a bunch of examples on the following website http://www.artfulsoftware.com/infotree/queries.php and decided to take a look at them.

Two query examples had the same concern.

1). Suppose you have the following table that summarizes squash court bookings.

image

You want each member of the club to pay half the fee, if he only appears in one column (member1 or member2), or to pay the full price, if he appears in both columns.

2). Suppose you have the following teams and games tables.

image

image

You want to get this output:

image

For each team, you want to count the number of games played (if the team appears in the column team1 or team2 from the games table) and number of games won (if the team appears in the team1 column and score1 > score2 or it appears in the team2 column and score1 < score2 ), lost, drawn and the number of goals scored, …

Writing a SQL query to do this is not hard. On may go through a UNION operation and then summarize the data accordingly, as proposed the site mentioned above, but I find the following query to be closer to the natural formulation of the problem:

SELECT
T.name
, SUM(CASE
WHEN T.id = team1 AND score1 > score2 THEN 1
WHEN T.id = team2 AND score1 < score2 THEN 1
ELSE 0
END) AS WON
, SUM(CASE
WHEN T.id = team1 AND score1 = score2 THEN 1
WHEN T.id = team2 AND score1 = score2 THEN 1
ELSE 0
END) AS DRAW
    … and so on
FROM teams AS T
LEFT JOIN games AS G
ON T.id = G.team1
OR T.id = G.team2
GROUP BY T.name

If you had no separate table for the team, you would have to replace the team table in the previous query with a sub-query. Something along the line :  (SELECT teamname1 FROM games UNION SELECT teamname2 FROM games).

Using Excel, building such a summary table is easy, provided you have the list of team names you are interested in. Unfortunately, there is no easy way to retrieve a list of unique values from both columns out-of-the box. Of course, this can still be done with some VBA to automate the process, but this is not native. Please note the spreadsheet from Google Doc offers the nice UNIQUE function, which enables to get a list of distinct values from a range, but the function will not work for values spread across 2 columns. Pivot tables will not work in that case either, and they do not support many-to-many relationships anyway. This is true for the standard pivot table, this is also true for Powerpivot V1. (V2 will offer some support for overriding a relationship with an expression.)

Now, what I would like to see in a BI system is the ability to support such queries, and as well support for building such queries with some simple drag and drop operation. If you know of a tool with these features, please let me know.