SQL Performance Tuning for Power BI Devs Who Don’t Know Database Architecture

Many Power BI developers come from an Excel background and have little to no SQL experience. Yet SQL databases are pervasive data sources and there’s not always a database administrador available to help set up a proper data warehouse. This can put analysts and developers coming from the self-service side in a difficult situation as they may even fail to be aware of their blind spots. Unknowns unknowns are bad!

Meanwhile, the vast majority of content published about Power BI performance focuses on PBI internals with an emphasis on data modeling best practices and optimal DAX syntax. All of this is necessary but not sufficient as BI performance relies on end-to-end data pipelines, not just the Power BI caching and consumption endpoint.

Even if you have a database administrator available to help you, they might know SQL Server very well yet not have a lot of analytics experience, which comes with different optimization paths than transactional usage.

This entry gives an overview of SQL performance considerations and optimizations that are most relevant to Power BI, and to a large extent, other BI stacks. We’ll even talk about Microsoft Fabric briefly. The goal is to help people without a database architectural background identify bottlenecks at the source and either address them themselves, or at least be in a position to ask pointed questions that might speed up support from IT staff. “Can you please optimize the SQL database for Power BI” is nowhere near as actionable as “Let’s test our Power BI refresh times with and without non-clustered column indexes on our main fact tables”. Guess which JIRA ticket is most likely to get done in a timely fashion?

Since there’s already a huge amount of excellent material out there, we’re going to use links and embedded videos abundantly rather than restate what has already been well covered elsewhere. Use this entry as a structured reading guide to get you started.

1. SQL Performance from Power BI’s Vantage Point

1.1. How Is SQL Relevant to Power BI Performance and BI Costs?

Let’s start from the desired end result then we’ll work backwards from it. Power BI has two storage modes:

  1. Import mode. The default, and most often the desirable choice, which caches data in the VertiPaq engine. In this case our concern is to refresh data from sources as fast and reliably as possible, either through a complete refresh, or better, using partitions (most often via incremental refresh, I’ll cover this in depth in a future separate entry).
  2. DirectQuery mode, where our concern is for live queries to be executed as fast as possible in the source. DQ effectively bypasses Power BI’s core value proposition and should be seen as a last resort solution when data volumes or refresh frequencies rule out Import, after exhausting Import optimizations and surgical programmatic options. Even then, DQ is best combined with Import using dual storage tables and aggregations.

In both cases, speed is the visible tip of the iceberg. but we may also be concerned with underlying costs, especially with modern on-demand cloud DaaS platforms that separate storage from compute and charge based on usage.

1.2. You Can’t Improve What You Can’t Measure: Performance Tuning Tooling

While there are a lot of third-party tools for Power BI Desktop, for this entry’s purposes SQL Server Management Studio (SSMS) reigns supreme. It’s free but runs only on Windows, and cross-platform alternatives such as Azure Data Studio are good for querying but don’t include nearly the same amount of management and measurement features.

Microsoft, Either Trolling Deliberately…
Or Caring, But Not That Much

SSMS includes SQL Server Profiler, which can be used across SSAS, AAS, Power BI Premium, or against a local dataset in Power BI Desktop. Note that on one hand the official documentation states that “SQL Trace and SQL Server Profiler are deprecated” yet “SQL Server Profiler for Analysis Services workloads are supported”, there’s even an entire page about it.

Meanwhile Extended Events (xEvents) are supposed to replace SQL Trace/Server Profiler, but they only work with SSAS/AAS and not Power BI Premium. And yet, you can somehow use xEvents against Power BI Desktop. Make it make sense Microsoft!

You’ll also want to use Power BI Desktop’s built-in Performance Analyzer.

1.3. Measuring Import Refresh Performance

Before you dive into SQL optimization, start by identifying where your bottlenecks are. Premature or marginal optimization efforts are counterproductive distractions, you don’t want to chase milliseconds just for the sake of elegance. It’s possible for instance that your refreshes against a SQL source are slow because of lack of query folding on the Power BI side.

These two entries break down what you have to do, read them then come back here to proceed with the SQL part:

Chris Webb has a good series of blog posts on how to keep track of your dataset refresh history.

1.4. Measuring DirectQuery Performance

I want to restate before you continue that you really need to assess whether you have to use DirectQuery at all, as you can expect that it will underperform Import in most cases, often significantly. Microsoft has an entire whitepaper on this topic, from which I’ve taken the chart below. If you’re going to use DirectQuery, you need to understand how it works and what you can and cannot do with it, otherwise you’re likely to deliver a subpar user experience to report consumers.

Import vs. DQ

2. SQL Performance Concepts & Concerns in a Nutshell

Let’s break down the key concepts the SQL novice needs to understand to optimize BI performance, ordered in a logical sequence so that you always have the required knowledge to move forward. This section is about SQL entities, not the SQL language, and if you’re already familiar with relational databases you probably won’t learn much here.

2.1. The Base Building Blocks: Tables, Queries, Views

Relational databases store tabular data, i.e. data is stored in tables made of columns describing the data, and rows populating it. A SQL query is an expression that instructs the server to manipulate database objects in various ways including creating, updating, deleting, and returning tables or subsets of tables. A query that returns a valid result set can be then saved as a virtual (or logical) table called a view to ease reuse.

A view is a stored query that acts as a filter to the underlying table(s), it performs just the same as running its underlying query against its source(s). Views are a development productivity tool, in their virtual default state they don’t impact query performance. It’s a commonly accepted best practice to use views in BI applications rather than load data directly from the physical tables.

While you’re reading on SQL Server, you might run into mentions of stored procedures, batches of SQL statements that ease reuse and modularity. Among other things, you can create and modify views and tables with procedures, but let’s leave them out of this entry’s scope to ease your learning curve.

2.2. Indexes: The Foundation of Query Speed

Indexes can tremendously improve the read performance of a SQL table by saving the database engine from having to scan entire tables. They’re not necessarily appropriate for tables that need to be very frequently written to, but usually for analytics purposes we work from read-only copies. Here’s a good overview covering some of the main do’s and don’ts.

A table without any index is dubbed a “heap”. Indexes come in various flavors:

  • Clustered and Non-Clustered. A clustered index defines the logical (but not always physical) stored sorting of a table, thus there can be only one by table. A non-clustered index doesn’t affect the physical table and is stored separately from it.
  • Rowstore and Columnstore. Rowstore is the traditional way to store data in relational databases while Columnstore is optimized for analytics payloads.
  • Filtered, i.e. “an optimized disk-based rowstore nonclustered index especially suited to cover queries that select from a well-defined subset of data.”
  • Unique, i.e. an index that “ensures the index key columns do not contain any duplicate values”, i.e. it’s a constraint on insert/update operations and is not really relevant to our reading optimization considerations.
Introducing Columnstore indexes: made for BI

Here are a few items of note on indexes:

  • An Index Seek is usually faster than an Index Scan. Not always though, it’s pretty nuanced, and if you need to a query that gets an entire table, well the table needs to scanned.
  • Clustered indexes tend to be faster than non-clustered ones, but again, that’s not always the case.
  • SQL Server automatically creates a unique clustered index on a table’s primary key, so if you want to create your own clustered index, you’ll have to delete the default one first.
  • Rowstore, columnstore, and unique indexes can all be of the clustered and non-clustered kinds.
  • You cannot ignore columns while defining a Clustered Columnstore index.
  • Columnstore indexes were actually introduced in SQL Server based on SSAS’s VertiPaq/xVelocity columnar compression approach (see VertiPaq vs. Columnstore).
  • Dedicated SQL pools in Azure Synapse automatically create a clustered columnstore index on tables that have no index options. The enhanced compute engine in Power BI Dataflows works the same way, and I believe that’s how the SQL engine embedded in Power BI datamarts also behaves.
  • Indexes are not free, they can take a lot of time to build.
And now the in-depth tests

2.3. Materialized Views Provide Precomputing Efficiency

Views can be materialized, that is, the data returned by the view is persisted on disk. In other words, a materialized view is a self-updating table derived from other tables via a stored query. Pre-computing data ahead of its use remains a tried-and-true optimization technique, though it obviously comes with its own set of tradeoffs and costs. In fact, materialized views are typically used to precompute Power BI aggregations or their equivalent in competing platforms, such as Looker’s derived tables.

Microsoft calls them “indexed views” in T-SQL, though they eventually adopted the more common “materialized view” phrase in Azure Synapse (see also: Benefits of using materialized views).

Materialized views should be a significant component of your overall performance and cost optimization strategy, whether you’re using MS SQL or a cloud data warehouse/OLAP engine such as Snowflake, BigQuery, or ClickHouse. Implementation details and SQL syntax may differ, but the underlying patterns are fairly similar.

Views + Indexes = Doublegood

2.4. Know What’s Going On with Execution Plans, Query Store, DMVs

The SQL Server database engine analyses any query it receives to determine the most efficient way to execute it, by a component called the Query Optimizer (QO). The QO not only reviews the statement, but also the database schema (tables and indexes) and database statistics. Its output is known as an execution plan. In SSMS you can review in graphical form:

  • estimated execution plans
  • actual execution plans
  • live query statistics

Reviewing, interpreting, and comparing execution plans is the primary way you’re going to be able to assess how your various queries and indexes perform in comparison to each other.

You won’t be able to do any meaningful work without this

Query Store was introduced with SQL Server 2016 to “automatically captures a history of queries, plans, and runtime statistics”. It is enabled by default with SQL Server 2022 while you have to turn it on in earlier versions. Where execution plans let you analyze queries one at a time, Query Store gives you server-wide query statistics, which you could even visualize… using Power BI!

2019 version (SQL 2022 has seen continued investment in this area)

SQL Server has a bunch of other monitoring tools, a comparison table can be found here.

Dynamic Management Views (DMVs) are also a great source of insights about a SQL database. A lot of them pertain to administration concerns that are not strictly related to performance optimization, but some of them do cover queries, indexes, resource usage, and other matters of interest.

Way more useful than the other DMV (Department of Motor Vehicles)

2.5. Refresh Only New/Updated Data with Partitions

Partitions are relevant both on the SQL storage side and on the Power BI caching side. I’ll cover Power BI partitions in depth in a future entry. On the SQL side, read:

Incremental refresh’s friend

3. Putting Everything Together: SQL Optimization Recommendations for Power BI

Now that you’ve gone through the prerequisite course material, you’re equipped to understand recommendations and make sure they’re a fit for your own use cases. Start by reading: Improve Power BI Performance with SQL Server Indexing to see a set of practical tests and comparisons. Remember that you cannot take someone else’s benchmarks as the gospel for your own implementation. Everyone’s data is different and you’ll have to try out techniques to ensure they’re a good fit for you.

Here’s my own checklist to wrap things up:

  • Don’t neglect source performance, there’s more to Power BI than what’s happening within it.
  • Focus on the source tables that are real performance drains, don’t waste time optimizing tiny tables (typically under 100.000 rows or even 1M rows, depending on table structure).
  • Load views in Power BI datasets rather than tables. Consider materializing them unless you have very high refresh frequencies.
  • Use Power BI aggregations based on SQL materialized views.
  • Make sure you have indexes on larger tables. Nonclustered columnstore indexes are the best place to start. Consider using several indexes to cover various querying scenarios, but always keep cost/benefit in mind. Test that your indexes are actually hit by your refreshes and/or queries.
  • Align Power BI partitions with SQL partitions.
  • Measure before/after performance.
  • Don’t perform tests on your production source database and Power BI dataset! Look into deployment pipelines to handle dev/prod environments.
  • Do understand the key concepts in this entry, you can’t throw random experiments at the wall just to see what sticks.
  • Also keep in mind that the information presented here is simplified, and I’m far from a SQL expert myself. All this stuff can get fairly complicated and some people have dedicated an entire career to relational databases. I actually had to correct a couple of details where I had relayed common misconceptions after people on LinkedIn pointed them out.

4. What About Microsoft Fabric?

The core storage layer in Fabric is OneLake, a data lake shared across all payloads. OneLake doesn’t support column indexes since it’s object storage, not a relational database, and the Fabric data warehouse likewise doesn’t support columnstore indexes. The reasoning is that files stored in Parquet offer functionally equivalent compressed column storage. Since OneLake uses Delta Lake, one may expect support for Z-order indexes eventually, like found in Databricks.

5. Further Reading

(Links to Amazon include my affiliate link, this helps pay my blog’s hosting.)

Leave a Reply

Your email address will not be published. Required fields are marked *