A recent example in this direction in KQL are the dcount and
dcountif
functions, which according to the documentation calculates an estimate of the
number of distinct values that are taken by a scalar expression in the summary
group. An estimate is not the actual number of records, trading performance
for accuracy. The best example are the following pieces of code:
Output:
Output:
// counting records NewSales | summarize record_count = count() // values availanle , aprox_distinct_count = dcount(CustomerKey) // estimated disting values , distinct_count = count_distinct(CustomerKey) // actual number of records , aprox_distict_count_by_value = dcountif(CustomerKey, SalesAmount <> 0) //estimated count of records with not null amounts , distict_count_by_value = count_distinctif(CustomerKey, SalesAmount <> 0) // count of records with not null amounts , aprox_distict_count_by_value2 = dcountif(CustomerKey, SalesAmount == 0) //estimated count of records with null amounts , distict_count_by_value2 = count_distinctif(CustomerKey, SalesAmount == 0) // count of records with not amounts | extend error_aprox_distinct_count = distinct_count - aprox_distinct_count , error_aprox_distict_count_by_value = distict_count_by_value - aprox_distict_count_by_value
Output:
record_count | aprox_distinct_count | distinct_count | aprox_distict_count_by_value | distict_count_by_value | aprox_distict_count_by_value2 | distict_count_by_value2 |
2832193 | 18497 | 18484 | 18497 | 18484 | 10251 | 10219 |
error_aprox_distinct_count | error_aprox_distict_count_by_value | |||||
-13 | -13 |
It's interesting that the same difference is observable also when a
narrower time interval is chosen (e.g. 1 month). When using estimate it's
important to understand also how big is the error between the actual value
and the estimate, and that's the purpose of the last two lines added to
the query. In many scenarios the difference might be neglectable until is
not.
One can wonder whether the two functions are deterministic, in other words
whether they return the same results if given the same input values. It
would be also useful to understand what's the performance of the two
estimative functions especially when further constraints are applied.
Moreover, the functions accept a third parameter which allows control over
the trade between speed and accuracy (see provided table).
// counting records NewSales | summarize record_count = count() // values availanle , distinct_count = count_distinct(CustomerKey) // actual number of records , aprox_distinct_count = dcount(CustomerKey) // estimated disting values (default) , aprox_distinct_count0 = dcount(CustomerKey, 0) // 0-based accuracy , aprox_distinct_count1 = dcount(CustomerKey, 1) // 1-based accuracy (default) , aprox_distinct_count2 = dcount(CustomerKey, 2) // 2-based accuracy , aprox_distinct_count3 = dcount(CustomerKey, 3) // 3-based accuracy , aprox_distinct_count4 = dcount(CustomerKey, 4) // 4-based accuracy
record_count | distinct_count | aprox_distinct_count | aprox_distinct_count0 | aprox_distinct_count1 | aprox_distinct_count2 | aprox_distinct_count3 | aprox_distinct_count4 |
2832193 | 18484 | 18497 | 18793 | 18497 | 18500 | 18470 | 18487 |
It will be interesting to see which one of these parameters are used in practice.
The problems usually start when different approximation parameters are used alternatively with no
previous agreement. How could one argument in the favor of one parameter over the others?
A natural question: how big will be the error introduced by each parameter? Usually, when approximating values, one needs to specify also the expected error somehow. The documentation provide some guiding value, though are these values enough? Do similar estimate functions make sense also for the other aggregate functions?
In exchange, the count_distinct and count_distinctif seem to be still in preview, with all the consequences derived from this. They are supposed to be more resource-intensive than the estimative counterparts. Conversely, the values returned can be still rounded in dashboards up to the meaningful unit (e.g. thousands), and this usually depends on the context. The question whether the values can be rounded can be put also in the context and the estimative counterparts. It would be interesting to check how far away are the rounded values from each other in the context of the two sets of functions.
In practice, counting is useful for calculating percentages (e.g. how many customers come from a certain zone compared to the total), which are more useful and easier to grasp than big numbers:
// calculating percentages from totals NewSales | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video' | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01) | summarize count_customers = count_distinct(CustomerKey) , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States') , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada') , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada'))) | extend percent_customers_US = iif(count_customers<>0, round(100.00 * count_customers_US/count_customers, 2), 0.00) , percent_customers_CA = iif(count_customers<>0, round(100.00 * count_customers_CA/count_customers, 2), 0.00) , percent_customers_other = iif(count_customers<>0, round(100.00 * count_customers_other/count_customers,2), 0.00)
count_customers | count_customers_US | count_customers_CA | count_customers_other | percent_customers_US | percent_customers_CA | percent_customers_other |
10317 | 3912 | 789 | 5616 | 37.92 | 7.65 | 54.43 |
Note:
When showing percentages it's important to provide also the "context", the actual count or amount. This allows to understand the scale associated with the percentages.
When showing percentages it's important to provide also the "context", the actual count or amount. This allows to understand the scale associated with the percentages.
Happy coding!
No comments:
Post a Comment