Pre-computing aggregates from granular data to improve live performance for end users by using computing resources at scheduled times is a well-established software pattern. In this entry we’ll look at how to set up aggregations across Microsoft business intelligence products. This functionality breaks the pattern of how most features have flown through the genealogy of Microsoft’s BI offering, so I thought it would be worth clarifying which one does what.
Most people reading this entry probably use Power BI, but performing a bit of software archeology to know how we got to the present state is always enlightening. The point of this entry is not to restate what’s already in the documentation, which I’ll point to repeatedly. Rather, this is an overview to help you navigate your options.
1. A Very Quick Intro to BI Aggregations
When OLAP cubes emerged at the end of the 90s as a performant pattern to manage multidimensional analysis of facts and measures, it became apparent that these systems were performing poorly when they had to do live queries at the highest grain from “tall” fact tables with many rows. One of the main solutions that emerged was to precompute and cache aggregates for cube cells at the (lower) grain of the cube’s dimensional leaves, vastly reducing the number of rows in the process. A typical example is to summarize transactions at the monthly or yearly level. This is well explained in further detail in Aggregations and Aggregation Designs. This approach is also described as a roll-up and is not unique to Microsoft, you’ll see it in other products such as Oracle BI Server.
Aggregations can be built on top of each other, meaning that a yearly aggregation can be easily calculated off a monthly one, instead of crunching the high grain fact table all over again. They’re hidden from end users since they’re a backend performing tuning tool that’s not meant for human consumption.
Just to rule out any misunderstanding, “aggregate functions” in tabular models have nothing to do with aggregations and refer instead to the “aggregate function to be used by reporting tools to summarize column values”, i.e. these are the summarizations you can choose in Power BI Desktop for implicit measures.
2. Aggregations in SQL Server Analysis Services Multidimensional
Many Power BI developers coming from the business side don’t know about the product’s roots and predecessors, but it’s to their loss as enterprise BI patterns haven’t changed that much over the past 20+ years. When Microsoft comes up with an excited announcement that Power BI has a “new” feature, especially on the Premium side, take it with a grain of salt as they may mean “new in Power BI” as opposed to “completely novel”. Case in point, aggregations had been in SSAS for a long time already by the time they were added to Power BI.
SSAS MD’s aggregations are tightly coupled to that platform’s underlying OLAP cubes. They can be set up using two tools:
- The Aggregation Design Wizard for the initial manual design
- The Usage Based Optimization Wizard for automated resource allocation based on query logs
Aggregation design requires the developer to find the sweet spot where the resources consumed by aggregation computation and storage are not out of line with the performance gains:
“The goal is to design the optimal number of aggregations. This number should not only provide satisfactory response time, but also prevent excessive partition size. A greater number of aggregations produces faster response times but it also requires more storage space and may take longer to compute. Moreover, as the wizard designs more and more aggregations, earlier aggregations produce considerably larger performance gains than later aggregations.”Designing Aggregations
There’s quite a lot going on here, including re-usable “aggregation designs”, the interaction of aggregations with partitions, and other advanced considerations beyond the scope of this overview.
3. Aggregations in SSAS/AAS Tabular
This section is going to be very short: aggregations do not exist in SSAS/AAS tabular models! This is a surprising break from the usual enterprise BI feature inheritance at Microsoft, which usually starts with SSAS MD, then has an equivalent in SSAS tabular, then makes its way into AAS and eventually into Power BI Premium. There are also some features that only exist in SSAS MD, but then they don’t exist anywhere else (e.g. there’s simply no writeback in tabular products). And there are some features that exist only in Power BI Premium. So this AS Tabular gap is a bit unusual.
Later in this entry we’ll talk about a programmatic option that can also be used for AS tabular models.
4. Aggregations in Power BI
Manual aggregations were introduced in September 2018 and went GA in July 2019. Automated aggregations followed in Public Preview in August 2021 and went GA in May 2022. They were one of the first steps to take Power BI beyond its initial self-service positioning to also scale to enterprise payloads.
4.1. User-Defined Aggregations for Pro and Premium
Of course aggregation tables are going to be stored in Import mode since the entire point is to leverage the in-memory cache. But a significant drawback of user-defined aggregations is that Microsoft forces you to have their underlying fact tables in Direct Query mode. Josh Caplan said back in 2019 that being able to keep detail tables in Import mode was “coming”, but it’s nowhere to be seen in 2023. Fear not, we’re about to introduce a workaround.
The user interface to manage aggregations does offer an easy way to define aggregation tables with their summarization (Sum, GroupBy, Avg etc.), source detail table, and detail column (dimension key or fact to be aggregated). However I find that the required hodgepodge of Direct Query and Dual Storage constraints is not the most pleasant data modeling exercise, even if composite models themselves are a great idea.
Bear in mind that you also have to generate an aggregated data source, either upstream of the dataset – typically in a data warehouse’s SQL view or stored procedure – or within the dataset via Power Query, a simpler shortcut that I wouldn’t recommend for production. So in practice these user-defined aggregations are not always an end-to-end no-code solution, but it’s easy enough to get started. If you have a data warehouse, then you’ll have to spend some time architecting and optimizing how your aggregate queries are processed and possibly materialized. Here’s an example in Azure Synapse dedicated pools where the right table distribution method has a dramatic impact on performance.
In production you’ll have to take into account the potential compute costs and refresh times of your aggregate sources and where you want that compute/storage to occur (DWH vs. Power BI dataflows vs. Power BI dataset).
The initial setup aside, the end result is great, as this “aggregate awareness” means that DAX measures built on top of detail fact tables will automatically hit the aggregation tables instead whenever possible. In other words, you don’t have to rewrite any DAX to benefit from aggregations.
For bigger models you might want to “stack” aggregations of various grains on top of each other, which is done via their precedence setting. You’ll want to test to make sure you’ve properly set up your model so that you’re actually hitting the aggregations using tools such as DAX Studio or SQL Performance Analyser.
4.2. Automatic Aggregations for Premium
Automatic aggregations are a newer addition to Power BI Premium (i.e. not available in Pro) that reminds me of SSAS MD’s usage-based automation wizard, since they derive from query usage which aggregations are worth computing. Of course in the 2020s this has to be marketed as fancy “machine learning training”! And it is a bit of a black box compared to SSAS. I haven’t used them and haven’t heard much about their real-world value, feel free to comment below if you’ve had exposure to them.
5. Programmatic Alternatives
Michael Kovalsky, a Microsoft employee who’s worked on some their large internal data models, came up with an interesting way to roll your own aggregations in a semi-automated way. He wrote AutoAggs C# scripts in Tabular Editor to create aggregation tables and their partitions, tables, and relationships, then making DAX measures aware of the aggregations. Michael also created a user interface dubbed Agg Wizard with a CLI option that can be triggered from cloud orchestration tools such as ADF of Azure Functions. You’ll still have to write your source SQL queries.
This makes aggregations available to SSAS/AAS Tabular and allows the use of Import storage mode for underlying fact tables. And it’s not going to consume compute resources unlike the official automatic aggregations. The main drawback of his approach is that it’s limited to Power BI Premium because of its XMLA endpoint dependency. There are a few other foibles such as the fact partitions of the base fact table must be of ‘provider-type’ flavor, i.e. defined with a SQL query which excludes the M expression created by default by Power BI (since any table must have at least one partition) or calculated tables.
Overall, I like this approach a lot even though it’s not the officially sanctioned path, but it’s clearly more involved. I have a growing interest in programmatic dataset management and am currently testing these tools. Speaking of which, the TMSL/TOM scripting options for the official user-defined and automatic aggs are very limited at the moment, and while they’re supposed to be on the roadmap since mid-2022, there’s no public timeline for doing so. Another Coming Soon™ situation…
6. Aggregation Patterns & Further Reading
If you’d like to dive deeper into aggregations, including some non-conventional patterns, read Phil Seamark’s blog post series:
- Creative Aggs (2019) – look in particular at the “shadow model” approach to keep fact tables in Import mode
- Power BI Aggregations (2021 – work in progress unfortunately stalled)
A final note is that Power BI aggregations can be defined using relationships or GroupBys, as explained in Architecting Aggregations in PowerBI with Databricks SQL. By default you’ll probably leverage relationships as that conforms with the general ethos of using star schemas in Power BI, but GroupBy has its uses depending on source data structure, typically when you have degenerate dimensions in your fact table rather than separate dimensions).
This frames my conclusion: aggregations are a core performance optimization technique that has to be seen as a moving part of your data architecture and modeling, as they deeply integrate with compute/storage trade-off considerations – i.e. where and when to calculate aggregate measures – and are typically either part of OLAP cube or tabular star schema design.