SQL Troubles: aggregation

Showing posts with label aggregation. Show all posts

01 March 2025

💎🤖💫SQL Reloaded: Copilot Stories (Part VI: Queries based on Dynamics 365 for Finance & Operations)

As part of ad-hoc queries, SSRS/Paginated Reports and Data Warehouse solutions, during the past 15 years I wrote hundreds of queries based on Dynamics AX, respectively on Dynamics 365 (D365) for Finance, Supply Chain, Production, Human Resources and a few other modules. So, given that the documentation for the D365 schema is available online, I've been wondering how easy would be to write a basic query using M365 Copilot. A relatively simple example would be the join between the Products (InventTable) and Prices (InventTableModule) tables, a quite simple thing to do (the joins were discussed in here).

Disclaimer: The below code was generated by Copilot via prompts used for demonstration purposes!

The first attempt proved to be far from the expectations:

Prompt:
Write a query that joins InventTable and InventTableModule to retrieve the Item Number, Description and the Item Prices from Dynamics 365 for Finance. Include tables' schema and rewrite the query by placing the comma first.

Output:

SELECT 
    , IT.ItemId AS [Item Number]
    , IT.ItemName AS [Description]
    , ITM.Price AS [Item Price]
FROM 
    dbo.InventTable IT
JOIN 
    dbo.InventTableModule ITM 
    ON IT.ItemId = ITM.ItemId 
    AND IT.DataAreaId = ITM.DataAreaId
WHERE 
    ITM.ModuleType = 0 -- Assuming 0 represents the sales module, adjust as necessary

A first version of the query didn't included the join on DataAreaId. Another problem is that compared with the previous versions, in D365 the ItemName is not available in the InventTable anymore while the ItemName was incorrectly considered as Description. The differences between the various versions of D365 are minimal, though the differences between D365 and AX 2009 are considerable with important changes due to redesign.

Prompt:
Rewrite the query to provide the other existing prices via a grouping.

Output:

SELECT 
    , IT.ItemId AS [Item Number]
    , IT.ItemName AS [Description]
    , STRING_AGG(CAST(ITM.Price AS NVARCHAR), ', ') AS [Item Prices]
FROM 
    dbo.InventTable IT
JOIN 
    dbo.InventTableModule ITM 
    ON IT.ItemId = ITM.ItemId 
    AND IT.DataAreaId = ITM.DataAreaId
WHERE 
    ITM.ModuleType = 0 -- Assuming 0 represents the sales module, adjust as necessary
GROUP BY 
    IT.ItemId, IT.ItemName

The generated query shows all the prices as a string, solution which might work if we change the delimiter, though is not an optimal solution at all.

Prompt:
Rewrite the query to provide the other existing individual prices via a grouping in separate fields.

Output:

SELECT 
    , IT.ItemId AS [Item Number]
    , IT.ItemName AS [Description]
    , MAX(CASE WHEN ITM.ModuleType = 0 THEN ITM.Price END) AS [Sales Price]
    , MAX(CASE WHEN ITM.ModuleType = 1 THEN ITM.Price END) AS [Purchase Price]
    , MAX(CASE WHEN ITM.ModuleType = 2 THEN ITM.Price END) AS [Cost Price]
FROM 
    dbo.InventTable IT
JOIN 
    dbo.InventTableModule ITM 
    ON IT.ItemId = ITM.ItemId 
    AND IT.DataAreaId = ITM.DataAreaId
GROUP BY 
    IT.ItemId, IT.ItemName

This version is closer to the original query I wrote some time ago:

-- Item prices via group by
SELECT ITD.DataAreaId 
, ITD.ItemId  
, ITD.InventPrice
, ITD.InventUnitId
, ITD.PurchPrice
, ITD.PurchUnitId
, ITD.SalesPrice
, ITD.SalesUnitId
FROM dbo.InventTable ITM
     LEFT JOIN (-- price details
     SELECT ITD.ItemId
     , ITD.DataAreaId 
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.Price END) InventPrice
     , Max(CASE ITD.ModuleType WHEN 0 THEN ITD.UnitId END) InventUnitId
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.Price END) PurchPrice
     , Max(CASE ITD.ModuleType WHEN 1 THEN ITD.UnitId END) PurchUnitId
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.Price END) SalesPrice
     , Max(CASE ITD.ModuleType WHEN 2 THEN ITD.UnitId END) SalesUnitId
     FROM dbo.InventTableModule ITD
     GROUP BY ITD.ItemId
     , ITD.DataAreaId 
    ) ITD
       ON ITD.ItemId = ITM.ItemId
      AND ITD.DataAreaId = ITM.DataAreaId
WHERE ITD.DataAreaID = 'XYZ'
ORDER BY ITD.ItemId

Prompt:
Rewrite the query to provide the other existing individual prices via a grouping in separate fields.

Output:

SELECT 
    , IT.ItemId AS [Item Number]
    , IT.ItemName AS [Description]
    , Sales.Price AS [Sales Price]
    , Purchase.Price AS [Purchase Price]
    , Cost.Price AS [Cost Price]
FROM 
    dbo.InventTable IT
LEFT JOIN 
    dbo.InventTableModule Sales 
    ON IT.ItemId = Sales.ItemId 
    AND IT.DataAreaId = Sales.DataAreaId 
    AND Sales.ModuleType = 0 -- Assuming 0 represents the sales module
LEFT JOIN 
    dbo.InventTableModule Purchase 
    ON IT.ItemId = Purchase.ItemId 
    AND IT.DataAreaId = Purchase.DataAreaId 
    AND Purchase.ModuleType = 1 -- Assuming 1 represents the purchase module
LEFT JOIN 
    dbo.InventTableModule Cost 
    ON IT.ItemId = Cost.ItemId 
    AND IT.DataAreaId = Cost.DataAreaId 
    AND Cost.ModuleType = 2 -- Assuming 2 represents the cost module

The structure of the query looks promising, though same in the previous query Copilot got the uses for ModuleType wrong (see below)! And here's the query I wrote previously:

-- Item Prices without grouping
SELECT ITM.DataAreaId 
, ITM.ItemId 
, ILP.UnitId InventUnitId 
, ILP.Price InventPrice 
, IPP.UnitId PruchaseUnitId
, IPP.Price PurchasePrice
, ISP.UnitId SalesUnitId
, ISP.Price SalesPrice
FROM dbo.InventTable ITM
      LEFT JOIN dbo.InventTableModule ILP
        ON ITM.ItemId = ILP.ItemId
       AND ITM.DataAreaId = ILP.DataAreaId
       AND ILP.ModuleType = 0 -- Warehouse
      LEFT JOIN dbo.InventTableModule IPP
        ON ITM.ItemId = IPP.ItemId
       AND ITM.DataAreaId = IPP.DatareaId 
       AND IPP.ModuleType = 1 -- Purchases
      LEFT JOIN dbo.InventTableModule ISP
        ON ITM.ItemId = ISP.ItemId
       AND ITM.DataAreaId = ISP.DataAreaId 
       AND ISP.ModuleType = 2 -- Sales	
WHERE ITM.DataAreaId = 'XYZ'

Probably the Copilot for Dynamics 365 Finance and Operations [1] works much better than the one from M365. Unfortunately, I don't have access to it yet! Also, if I would invest more time in the prompts the results would be closer to the queries I wrote. It depends also on the goal(s) considered - build a skeleton on which to build the logic, respectively generate the final query via the prompts. Probably the 80-20 rule applies here as well.

Frankly, for a person not knowing the D365 data model, the final queries generated are a good point to start (at least for searching for more information on the web) as long Copilot got the prompts right. It will be also interesting to see how business rules related to specific business processes (including customizations) will be handled. The future looks bright (at least for the ones still having a job in the field)!

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Learn: Overview of Copilot capabilities in finance and operations apps [link]

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

Business Intelligence Series

There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes.

There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

Previous Post <<||>> Next Post

References:
[1] Alexander Pope (cca. 1711) An Essay on Criticism

09 February 2025

🌌🏭KQL Reloaded: First Steps (Part XI: Window Functions)

Window functions are one of the powerful features available in RDBMS as they allow to operate across several lines or a result set and return a result for each line. The good news is that KQL supports several window functions, which allow to address several scenarios. However, the support is limited and needs improvement.

One of the basic scenarios that takes advantage of windows functions is the creation of a running sum or creating a (dense) rank across a whole dataset. For this purposes, in Kusto one can use the row_cumsum for running sum, respectively the row_number, row_rank_dense and row_rank_min for ranking. In addition, one can refer to the values of a field from previous (prev) and next record.

// rank window functions within dataset (not partitioned)
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| sort by CustomerKey asc, DateKey asc 
| extend Rank1 = row_rank_dense(DateKey)
    , Rank2 = row_number(0)
    , Rank3 = row_rank_min(DateKey)
    , Sum = row_cumsum(TotalCost)
    , NextDate = next(DateKey)
    , PrevDate = prev(DateKey)

Often, it's needed to operate only inside of a partition and not across the whole dataset. Some of the functions provide additional parameters for this:

// rank window functions within partitions
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| sort by CustomerKey asc, DateKey asc 
| extend RowRank1 = row_rank_dense(DateKey, prev(CustomerKey) != CustomerKey)
    , RowRank2 = row_number(0, prev(CustomerKey) != CustomerKey)
    , Sum = row_cumsum(TotalCost, prev(CustomerKey) != CustomerKey)
    , NextDate = iif(CustomerKey == next(CustomerKey), next(DateKey), datetime(null))
    , PrevDate = iif(CustomerKey == prev(CustomerKey), prev(DateKey),  datetime(null))

In addition, the partitions can be defined explicitly via the partition operator:

// creating explicit partitions
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| where CustomerKey in (12617, 12618)
| project CustomerKey, DateKey, ProductName, TotalCost
| partition by CustomerKey
(
    order by DateKey asc
    | extend prev_cost = prev(TotalCost, 1)
)
| order by CustomerKey asc, DateKey asc
| extend DifferenceCost = TotalCost - prev_cost

It will be interesting to see whether Microsoft plans for the introduction of further window functions in KQL to bridge the gap. The experience proved that such functions are quite useful in data analytics, sometimes developers needing to go a long extent for achieving the respective behavior (see visual calcs in Power BI). For example, SQL Server leverages several types of such functions, though it took Microsoft more than several versions to this make progress. More over, developers can introduce their own functions, even if this involves .Net programming.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Kusto: Window functions overview [link]

05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

In a standard scenario there is a fact table and multiple dimension table (see previous post), though one can look at the same data from multiple perspectives. In KQL it's recommended to start with the fact table, however in some reports one needs records from the dimension table independently whether there are any records in the fact tale.

For example, it would be useful to show all the products, independently whether they were sold or not. It's what the below query does via a RIGHT JOIN between the fact table and the Customer dimension:

// totals by customer via right join
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, TotalCost
//| summarize record_count = count() 
| order by CustomerName asc

The Product details can be added then via a LEFT JOIN between the fact table and the Product dimension:

// total by customer via right join with product information
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
    , LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightouter (
    Customers
    | where RegionCountryName in ('Canada', 'Australia')
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| join kind=leftouter (
    Products 
    | where  ProductCategoryName == 'TV and Video'
    | project ProductKey, ProductName 
    )
    on ProductKey
| project RegionCountryName, CustomerName, ProductName, TotalCost, FirstPurchaseDate, LastPurchaseDate, record_count
//| where record_count>1 // multiple purchases
| order by CustomerName asc, ProductName asc

These kind of queries need adequate validation and for this it might be needed to restructure the queries.

// validating the multiple records (defailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| lookup Products on ProductKey
| lookup Customers on CustomerKey 
| where  FirstName == 'Alexandra' and LastName == 'Sanders'
| project CustomerName = strcat(FirstName, ' ', LastName), ProductName, TotalCost, DateKey, ProductKey, CustomerKey 

// validating the multiple records against the fact table (detailed)
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| where CustomerKey == 14912

Mixing RIGHT and LEFT joins in this way increases the complexity of the queries and sometimes comes with a burden for validating the logic. In SQL one could prefer to start with the Customer table, add the summary data and the other dimensions. In this way one can do a total count individually starting with the Customer table and adding each join which reviewing the record count for each change.

In special scenario instead of a RIGHT JOIN one could use a FULL JOIN and add the Customers without any orders via a UNION. In some scenarios this approach can offer even a better performance. For this approach, one needs to get the Customers without Orders via an anti-join, more exactly a rightanti:

// customers without orders
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost)
    , FirstPurchaseDate = min(DateKey)
, LastPurchaseDate = max(DateKey) by CustomerKey, ProductKey
| join kind=rightanti (
    Customers
    | project CustomerKey, RegionCountryName, CustomerName = strcat(FirstName, ' ', LastName)
    )
    on CustomerKey
| project RegionCountryName, CustomerName, ProductName = '', TotalCost = 0
//| summarize count()
| order by CustomerName asc, ProductName asc

And now, joining the Customers with orders with the ones without orders gives an overview of all the customers:

// total by product via lookup with table alias
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize record_count = count()
    , TotalCost = sum(TotalCost) by CustomerKey //, ProductKey 
//| lookup Products on ProductKey 
| lookup Customers on CustomerKey
| project RegionCountryName
    , CustomerName = strcat(FirstName, ' ', LastName)
    //, ProductName
    , TotalCost
//| summarize count()
| union withsource=SourceTable kind=outer (
    // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
)
//| summarize count()
| order by CustomerName asc
//, ProductName asc

And, of course, the number of records returned by the three queries must match. The information related to the Product were left out for this version, though it can be added as needed. Unfortunately, there are queries more complex than this, which makes the queries more difficult to read, understand and troubleshoot. Inline views could be useful to structure the logic as needed.

let T_customers_with_orders = view () { 
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost) 
        by CustomerKey
         //, ProductKey 
         //, DateKey
    //| lookup Products on ProductKey 
    | lookup Customers on CustomerKey
    | project RegionCountryName
        , CustomerName = strcat(FirstName, ' ', LastName)
        //, ProductName
        //, ProductCategoryName
        , TotalCost
        //, DateKey
};
let T_customers_without_orders = view () { 
   // customers without orders
    NewSales
    | where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
    | where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
    | summarize record_count = count()
        , TotalCost = sum(TotalCost)
        , FirstPurchaseDate = min(DateKey)
        , LastPurchaseDate = max(DateKey) by CustomerKey
        //, ProductKey
    | join kind=rightanti (
        Customers
        | project CustomerKey
            , RegionCountryName
            , CustomerName = strcat(FirstName, ' ', LastName)
        )
        on CustomerKey
    | project RegionCountryName
        , CustomerName
        //, ProductName = ''
        , TotalCost = 0
    //| summarize count()
};
T_customers_with_orders
| union withsource=SourceTable kind=outer T_customers_without_orders
| summarize count()

In this way the queries should be easier to troubleshoot and restructure the logic in more manageable pieces.

It would be useful to save the definition of the view (aka stored view), however it's not possible to create views on the machine on which the tests are run:

"Principal 'aaduser=...' is not authorized to write database 'ContosoSales'."

Happy coding!

Previous Post <<||>> Next Post

08 November 2018

🔭Data Science: Aggregation (Just the Quotes)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"The null hypothesis of no difference has been judged to be no longer a sound or fruitful basis for statistical investigation. […] Significance tests do not provide the information that scientists need, and, furthermore, they are not the most effective method for analyzing and summarizing data." (Cherry A Clark, "Hypothesis Testing in Relation to Statistical Methodology", Review of Educational Research Vol. 33, 1963)

"[…] fitting lines to relationships between variables is often a useful and powerful method of summarizing a set of data. Regression analysis fits naturally with the development of causal explanations, simply because the research worker must, at a minimum, know what he or she is seeking to explain." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Fitting lines to relationships between variables is the major tool of data analysis. Fitted lines often effectively summarize the data and, by doing so, help communicate the analytic results to others. Estimating a fitted line is also the first step in squeezing further information from the data." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Modern data graphics can do much more than simply substitute for small statistical tables. At their best, graphics are instruments for reasoning about quantitative information. Often the most effective way to describe, explore, and summarize a set of numbers even a very large set - is to look at pictures of those numbers. Furthermore, of all methods for analyzing and communicating statistical information, well-designed data graphics are usually the simplest and at the same time the most powerful." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data. Unfortunately, much of the data reported to executives today are aggregated and summed over so many different operating units and processes that they cannot be said to have any context except a historical one - they were all collected during the same time period. While this may be rational with monetary figures, it can be devastating to other types of data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Every number has its limitations; every number is a product of choices that inevitably involve compromise. Statistics are intended to help us summarize, to get an overview of part of the world’s complexity. But some information is always sacrificed in the process of choosing what will be counted and how. Something is, in short, always missing. In evaluating statistics, we should not forget what has been lost, if only because this helps us understand what we still have." (Joel Best, "More Damned Lies and Statistics: How numbers confuse public issues", 2004)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Whereas regression is about attempting to specify the underlying relationship that summarises a set of paired data, correlation is about assessing the strength of that relationship. Where there is a very close match between the scatter of points and the regression line, correlation is said to be 'strong' or 'high' . Where the points are widely scattered, the correlation is said to be 'weak' or 'low'." (Alan Graham, "Developing Thinking in Statistics", 2006)

"Numerical precision should be consistent throughout and summary statistics such as means and standard deviations should not have more than one extra decimal place (or significant digit) compared to the raw data. Spurious precision should be avoided although when certain measures are to be used for further calculations or when presenting the results of analyses, greater precision may sometimes be appropriate." (Jenny Freeman et al, "How to Display Data", 2008)

"Graphical displays are often constructed to place principal focus on the individual observations in a dataset, and this is particularly helpful in identifying both the typical positions of data points and unusual or influential cases. However, in many investigations, principal interest lies in identifying the nature of underlying trends and relationships between variables, and so it is often helpful to enhance graphical displays in ways which give deeper insight into these features. This can be very beneficial both for small datasets, where variation can obscure underlying patterns, and large datasets, where the volume of data is so large that effective representation inevitably involves suitable summaries." (Adrian W Bowman, "Smoothing Techniques for Visualisation" [in "Handbook of Data Visualization"], 2008)

"There are three possible reasons for [the] absence of predictive power. First, it is possible that the models are misspecified. Second, it is possible that the model’s explanatory factors are measured at too high a level of aggregation [...] Third, [...] the search for statistically significant relationships may not be the strategy best suited for evaluating our model’s ability to explain real world events [...] the lack of predictive power is the result of too much emphasis having been placed on finding statistically significant variables, which may be overdetermined. Statistical significance is generally a flawed way to prune variables in regression models [...] Statistically significant variables may actually degrade the predictive accuracy of a model [...] [By using] models that are constructed on the basis of pruning undertaken with the shears of statistical significance, it is quite possible that we are winnowing our models away from predictive accuracy." (Michael D Ward et al, "The perils of policy by p-value: predicting civil conflicts" Journal of Peace Research 47, 2010)

"In order to be effective a descriptive statistic has to make sense - it has to distill some essential characteristic of the data into a value that is both appropriate and understandable. […] the justification for computing any given statistic must come from the nature of the data themselves - it cannot come from the arithmetic, nor can it come from the statistic. If the data are a meaningless collection of values, then the summary statistics will also be meaningless - no arithmetic operation can magically create meaning out of nonsense. Therefore, the meaning of any statistic has to come from the context for the data, while the appropriateness of any statistic will depend upon the use we intend to make of that statistic." (Donald J Wheeler, "Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"[...] things that seem hopelessly random and unpredictable when viewed in isolation often turn out to be lawful and predictable when viewed in aggregate." (Steven Strogatz, "The Joy of X: A Guided Tour of Mathematics, from One to Infinity", 2012)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Decision trees are also discriminative models. Decision trees are induced by recursively partitioning the feature space into regions belonging to the different classes, and consequently they define a decision boundary by aggregating the neighboring regions belonging to the same class. Decision tree model ensembles based on bagging and boosting are also discriminative models." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies", 2015)

"Just as with aggregated data, an average is a summary statistic that can tell you something about the data - but it is only one metric, and oftentimes a deceiving one at that. By taking all of the data and boiling it down to one value, an average (and other summary statistics) may imply that all of the underlying data is the same, even when it’s not." (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"At very small time scales, the motion of a particle is more like a random walk, as it gets jostled about by discrete collisions with water molecules. But virtually any random movement on small time scales will give rise to Brownian motion on large time scales, just so long as the motion is unbiased. This is because of the Central Limit Theorem, which tells us that the aggregate of many small, independent motions will be normally distributed." (Field Cady, "The Data Science Handbook", 2017)

"The most accurate but least interpretable form of data presentation is to make a table, showing every single value. But it is difficult or impossible for most people to detect patterns and trends in such data, and so we rely on graphs and charts. Graphs come in two broad types: Either they represent every data point visually (as in a scatter plot) or they implement a form of data reduction in which we summarize the data, looking, for example, only at means or medians." (Daniel J Levitin, "Weaponized Lies", 2017)

"Again, classical statistics only summarizes data, so it does not provide even a language for asking [a counterfactual] question. Causal inference provides a notation and, more importantly, offers a solution. As with predicting the effect of interventions [...], in many cases we can emulate human retrospective thinking with an algorithm that takes what we know about the observed world and produces an answer about the counterfactual world." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"While the individual man is an insoluble puzzle, in the aggregate he becomes a mathematical certainty. You can, for example, never foretell what anyone man will be up to, but you can say with precision what an average number will be up to. Individuals vary, but percentages remain constant. So says the statistician." (Sir Arthur C Doyle)

SQL Troubles

Pages

01 March 2025

💎🤖💫SQL Reloaded: Copilot Stories (Part VI: Queries based on Dynamics 365 for Finance & Operations)

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

09 February 2025

🌌🏭KQL Reloaded: First Steps (Part XI: Window Functions)

05 February 2025

🌌🏭KQL Reloaded: First Steps (Part IV: Left, Right, Anti-Joins and Unions)

08 November 2018

🔭Data Science: Aggregation (Just the Quotes)

About Me