Why You Should Avoid Calculated Columns in Power BI
/First off I want to make sure it is clear that I am not saying “never use calculated columns.” I am not. What I am saying is they should be avoided, and there are several reasons. I am going to focus on the model size, and thus performance. Note that the data I am going to show is a contrived set of data for demonstration purposes only to explain.
Power BI, which includes Power Pivot models and SSAS Tabular models, compresses the data in tables as it is loaded to the service. It never uncompresses that data. All calculations are done on the compressed data. If you want to know how this works, I suggest you read Chapter 17 in The Definitive Guide to DAX, Second Edition. It does a thorough job of explaining how the VertiPaq engine works.
Here is the scenario. I have a 200M record table with three columns:
A list of numbers from 0 to 100,000,000 in 0.5 increments. This has 200M unique values.
A list of numbers that rounds those numbers to whole numbers. It has 100M unique numbers. 0.5 and 1.0 from the first column are 1 in the second, 1.5 and 2.0 are 2 in the second, etc.
A third column is a text value of “Even” or “Odd” that is based on the second column. This field only has 2 values.
The column to pay attention to is the [Even or Odd] column. I created this column two different ways. First in Power Query and then as a Calculated Column in DAX. This is important to understand.
All data in the model is compressed, but only data that is imported is taken into account when creating the optimal sorting algorithm to determine the best possible compression. So when I loaded the table with [Even or Odd] as an imported column, the VertiPaq engine considered that column when compressing the data. When it created it as a Calculated Column, it was compressed, but only by taking the [Index] and [Integer] columns into account for the best possible compression. Compression optimization happens at import and load. It does not get re-optimized and re-compressed when you add new columns.
Using the VertiPaq Analyzer from DAX Studio, we can see how much data each of these columns consumes.
Focus on the Total Size column. There are only 2 values in the [Even or Odd] column - “Even” or “Odd.” Note: The scenario gets a bit more complex here as these text values create a dictionary, which is a further optimization I am not going to get into here because the effect is the same on both tables - again, see the Definitive Guide to DAX for more on this.
On the bottom table, called “Imported”, the [Even or Odd] column consumes roughly 18KB. Remember, there are 200M records here. However, the VertiPaq engine looked at this column and realized it had a low cardinality, only 2, and sorted by that column first. This means theoritically it could represent this by just two values - “Even” for the first 100M records, and “Odd” for the second 100M records, and that is, in effect, what it did. At 18KB that is certainly more than 2 text values, but again, this gets deeper in to how VeritiPaq splits the data, and this 200M record table is broken up into 191 segments in Power BI. Again, read the book.
Now look at the top table - Calculated Column. In this table, the Vertipaq engine only looked at the [Index] and [Integer] columns when optimizing for compression, and there isn’t a lot of compression going on here. The [Index] isn’t compressed at all. There are 200M unique values. The [Integer] column can be compressed, but not by much, as there are still 100M unique values, which is why it is not quite half the size of the [Index] column - roughly 0.86GB vs 1.49GB.
But the [Even or Odd] column had to resign itself to be compressed at whatever compression algorithm was used by VertiPaq for the imported columns. As a result, even with 2 values, it is almost 25MB in size, or 1,338 times larger than the imported version!
What does that mean in practical terms? As one example, if you do any calculations based on the [Even or Odd] column, the computer hosting the model has to use 25MB to hold the Calculated Column version vs 18KB for the imported version. The speed of calculations will be faster on the smaller set of data. Again, this specific example is contrived, so the speed of calculations for 18kb vs 25MB will not be very large, or perhaps even noticeable.
Note too that when processing a model, the entire model is loaded in RAM. This is very different to how relational databases like SQL Server or Oracle work. They only load what they need from disk to do the necessary calculations. VertiPaq doesn’t work that way. It loads it all.
Now consider what happens if you add 10 more columns to each of these tables. Doing it using DAX will greatly increase the size of the table vs doing it in Power Query or in your data source.
And that is the point. Calculated Columns will increase the model size, which in turn increases the amount of RAM needed to hold the model, and the time it takes to run calculations on it.
There are other reasons to transform your data in Power Query or further upstream. In this example, I could have removed the “Integer” column from the table before loading keeping only two columns. You cannot do that with DAX. You can add to the model, you cannot take away without destroying data and any Calculated Columns that depend on it. This is why I avoid at all costs any LEFT/RIGHT/MID logic in a Calculated Column. I get what I need in Power Query, then remove the source column. Saves space, refreshes are faster, and calculations are faster.
So when should you use a Calculated Column? Here are the questions I ask myself:
Am I transforming data or adding data based on other columns? Do this in Power Query or earlier. This is often possible. Exceptions are when the DAX is looking at multiple groupings of data via a SUMMARIZE() or data in other tables. Power Query is inefficient here, but it may still be possible in an earlier source like SQL Server. This is especially important if you don’t need the original data you are transforming. You can delete unnecessary columns in Power Query when loading data. You cannot delete columns in the model without breaking Calculated Columns that rely on that data.
Does this need to be in a slicer or filter? This has to be a column - it cannot be a measure. But if it can be done in PQ or earlier, do it there.
Is the formula really complex and won’t work in a measure? This usually isn’t a good reason. I’ve yet to see a DAX formula in a Calculated Column that could not be reworked to be a measure. Calculated Columns have row context and not filter context. Measures are the opposite. They have filter context and not row context. This is where it becomes necessary to really understand evaluation context, how to create row context in a measure using iterators, and when to invoke context transition via CALCULATE. It can be done, it just requires some work. Reach out for help in the Power BI community or talk to co-workers.
If the formula in a measure is inefficient, then a Calculated Column helper column may be necessary. I had a particularly gnarly measure that used PERCENTILE.INC() three times over three groupings of data with SUMMARIZE(). The measure worked, but was horribly slow. PERCENTILE.INC() required the formula engine, and that was simply inefficient over 20M records as users filtered data. It was also impractical to attempt to do this in Power Query or the source systems, some of which were just CSV files. So I let a Calculated Column do the heavy lifting by moving all of the PERCENTILE.INC logic there, reducing the visual rendering from over 8,000ms to under 500ms when done. But it was a last resort, not the first thing I reached for.
Am I just developing right now? I use Calculated Columns every day during development. Super easy to analyze data and start the logic of the DAX code. But I make it a point to convert to imported columns to measures and remove the Calculated Columns from the code before I even push a model to the Test environment.
Remember, Calculated Columns aren’t inherently bad. They are there for a reason. But they aren’t meant to be used like you’d use a column in Excel. Most of the time, I find they are not needed and have adverse impacts on the model, but there are definitely use cases for them and some of my models have them for very particular reasons. The key as a good Power BI developer is knowing when to use them and when to avoid them.