20 November 2022

Data Science: Qualitative vs Quantitative (Just the Quotes)

"To us […] the only acceptable point of view appears to be the one that recognizes both sides of reality - the quantitative and the qualitative, the physical and the psychical - as compatible with each other, and can embrace them simultaneously […] It would be most satisfactory of all if physis and psyche (i.e., matter and mind) could be seen as complementary aspects of the same reality." (Wolfgang Pauli', "The Influence of Archetypal Ideas on the Scientific Theories of Kepler", [Lecture at the Psychological Club of Zurich], 1948)

"A model is a qualitative or quantitative representation of a process or endeavor that shows the effects of those factors which are significant for the purposes being considered. A model may be pictorial, descriptive, qualitative, or generally approximate in nature; or it may be mathematical and quantitative in nature and reasonably precise. It is important that effective means for modeling be understood such as analog, stochastic, procedural, scheduling, flow chart, schematic, and block diagrams." (Harold Chestnut, "Systems Engineering Tools", 1965)

"As is used in connection with systems engineering, a model is a qualitative or quantitative representation of a process or endeavor that shows the effects of those factors which are significant for the purposes being considered. Modeling is the process of making a model. Although the model may not represent the actual phenomenon in all respects, it does describe the essential inputs, outputs, and internal characteristics, as well as provide an indication of environmental conditions similar to those of actual equipment." (Harold Chestnut, "Systems Engineering Tools", 1965)

"In the long run, qualitative changes always outweigh quantitative ones. Quantitative predictions of economic and social trends are made obsolete by qualitative changes in the rules of the game. Quantitative predictions of technological progress are made obsolete by unpredictable new inventions. I am interested in the long run, the remote future, where quantitative predictions are meaningless. The only certainty in that remote future is that radically new things will be happening." (Freeman J Dyson, "Disturbing the Universe", 1979)

"[…] the meaning of the word 'solve' has undergone a series of major changes. First that word meant 'find a formula'. Then its meaning changed to 'find approximate numbers'. Finally, it has in effect become 'tell me what the solutions look like'. In place of quantitative answers, we seek qualitative ones." (Ian Stewart, "Nature's Numbers: The unreal reality of mathematics", 1995)

"Quantify. If whatever it is you’re explaining has some measure, some numerical quantity attached to it, you’ll be much better able to discriminate among competing hypotheses. What is vague and qualitative is open to many explanations." (Carl Sagan, "The Demon-Haunted World: Science as a Candle in the Dark", 1995)

"Quantitative knowing is dependent on qualitative knowledge [...] In quantitative data analysis, numbers map onto aspects of reality. Numbers themselves are meaningless unless the data analyst understands the mapping process and the nexus of theory and categorization in which objects under study are conceptualized." John T Behrens, "Principles and Procedures of Exploratory Data Analysis", 1997)

"Modeling, in a general sense, refers to the establishment of a description of a system (a plant, a process, etc.) in mathematical terms, which characterizes the input-output behavior of the underlying system. To describe a physical system […] we have to use a mathematical formula or equation that can represent the system both qualitatively and quantitatively. Such a formulation is a mathematical representation, called a mathematical model, of the physical system." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"Reductionism argues that from scientific theories which explain phenomena on one level, explanations for a higher level can be deduced. Reality and our experience can be reduced to a number of indivisible basic elements. Also qualitative properties are possible to reduce to quantitative ones." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001) 

"As every bookie knows instinctively, a number such as reliability - a qualitative rather than a quantitative measure - is needed to make the valuation of information practically useful." (Hans Christian von Baeyer, "Information, The New Language of Science", 2003)

"In order to understand how mathematics is applied to understanding of the real world it is convenient to subdivide it into the following three modes of functioning: model, theory, metaphor. A mathematical model describes a certain range of phenomena qualitatively or quantitatively. […] A (mathematical) metaphor, when it aspires to be a cognitive tool, postulates that some complex range of phenomena might be compared to a mathematical construction." (Yuri I Manin," Mathematics as Metaphor: Selected Essays of Yuri I. Manin", 2007)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"A commonly accepted principle of systems dynamics is that a quantitative change, beyond a critical point, results in a qualitative change. Accordingly, a difference in degree may become a difference in kind. This doesn't mean that an increased quantity of a given variable will bring a qualitative change in the variable itself. However, when the state of a system depends on a set of variables, a quantitative change in one variable beyond the inflection point will result in a change of phase in the state of the system. This change is a qualitative one, representing a whole new set of relationships among the variables involved." (Jamshid Gharajedaghi, "Systems Thinking: Managing Chaos and Complexity A Platform for Designing Business Architecture" 3rd Ed., 2011)

"Whether information comes in a quantitative or qualitative flavor is not as important as how you use it. [...] The key to making a good forecast […] is not in limiting yourself to quantitative information. Rather, it’s having a good process for weighing the information appropriately. […] collect as much information as possible, but then be as rigorous and disciplined as possible when analyzing it. [...] Many times, in fact, it is possible to translate qualitative information into quantitative information." (Nate Silver, "The Signal and the Noise: Why So Many Predictions Fail-but Some Don't", 2012)

"For although it is certainly true that quantitative measurements are of great importance, it is a grave error to suppose that the whole of experimental physics can be brought under this heading. We can start measuring only when we know what to measure: qualitative observation has to precede quantitative measurement, and by making experimental arrangements for quantitative measurements we may even eliminate the possibility of new phenomena appearing." (Heinrich B G Casimir)

Data Science: Encoding/Decoding (Just the Quotes)

"In comparison with Predicate Calculus encoding is of factual knowledge, semantic nets seem more natural and understandable. This is due to the one-to-one correspondence between nodes and the concepts they denote, to the clustering about a particular node of propositions about a particular thing, and to the visual immediacy of 'interrelationships' between concepts, i.e., their connections via sequences of propositional links." (Lenhart K Schubert, "Extending the Expressive Power of Semantic Networks", Artificial Intelligence 7, 1976)

"The digital-computer field defined computers as machines that manipulated numbers. The great thing was, adherents said, that everything could be encoded into numbers, even instructions. In contrast, scientists in AI [artificial intelligence] saw computers as machines that manipulated symbols. The great thing was, they said, that everything could be encoded into symbols, even numbers." (Allen Newell, "Intellectual Issues in the History of Artificial Intelligence", 1983)

"No matter how clever the choice of the information, and no matter how technologically impressive the encoding, a visualization fails if the decoding fails. Some display methods lead to efficient, accurate decoding, and others lead to inefficient, inaccurate decoding. It is only through scientific study of visual perception that informed judgments can be made about display methods." (William S Cleveland, "The Elements of Graphing Data", 1985)

"Good engineering is not a matter of creativity or centering or grounding or inspiration or lateral thinking, as useful as those might be, but of decoding the clever, even witty, messages the solution space carves on the corpses of the ideas in which you believed with all your heart, and then building the road to the next message." (Fred Hapgood, "Up the infinite Corridor: MIT and the Technical Imagination", 1993) 

"The unit of coding is the most basic segment, or element, of the raw data or information that can be assessed in a meaningful way regarding the phenomenon." (Richard Boyatzis, "Transforming qualitative information", 1998)

"The acquisition of information is a flow from noise to order - a process converting entropy to redundancy. During this process, the amount of information decreases but is compensated by constant re-coding. In the recoding the amount of information per unit increases by means of a new symbol which represents the total amount of the old. The maturing thus implies information condensation. Simultaneously, the redundance decreases, which render the information more difficult to interpret." (Lars Skyttner, "General Systems Theory: Ideas and Applications", 2001)

"Knowledge is encoded in models. Models are synthetic sets of rules, and pictures, and algorithms providing us with useful representations of the world of our perceptions and of their patterns." (Didier Sornette, "Why Stock Markets Crash - Critical Events in Complex Systems", 2003) 

"By giving numbers a proper shape, by visually encoding them, the graphic has saved you time and energy that you would otherwise waste if you had to use a table that was not designed to aid your mind." (Alberto Cairo, "The Functional Art", 2011) 

"Swarm intelligence (SI) is a branch of computational intelligence that discusses the collective behavior emerging within self-organizing societies of agents. SI was inspired by the observation of the collective behavior in societies in nature such as the movement of birds and fish. The collective behavior of such ecosystems, and their artificial counterpart of SI, is not encoded within the set of rules that determines the movement of each isolated agent, but it emerges through the interaction of multiple agents." (Maximos A Kaliakatsos-Papakostas et al, "Intelligent Music Composition", 2013)

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Encoding is called redundant when different visual channels are used to represent the same information. Redundant encoding is an efficient trick that helps to understand information from diagrams faster, easier, and more accurately. […] To decode information easier, align it with the reality in perspective of both the physical world and cultural conventions. Some things have particular colors, are larger or heavier than other, or are associated with the specific place. If your encoding is not compatible with these properties, readers may wonder why things do not look like they are expected to. Consequently, their auditory is forced to spend extra efforts decoding." (Vasily Pantyukhin, "Principles of Design Diagramming", 2015)

"To keep accuracy and efficiency of your diagrams appealing to a potential audience, explicitly describe the encoding principles we used. Titles, labels, and legends are the most common ways to define the meaning of the diagram and its elements." (Vasily Pantyukhin, "Principles of Design Diagramming", 2015)

"Creating effective visualizations is hard. Not because a dataset requires an exotic and bespoke visual representation - for many problems, standard statistical charts will suffice. And not because creating a visualization requires coding expertise in an unfamiliar programming language [...]. Rather, creating effective visualizations is difficult because the problems that are best addressed by visualization are often complex and ill-formed. The task of figuring out what attributes of a dataset are important is often conflated with figuring out what type of visualization to use. Picking a chart type to represent specific attributes in a dataset is comparatively easy. Deciding on which data attributes will help answer a question, however, is a complex, poorly defined, and user-driven process that can require several rounds of visualization and exploration to resolve." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Maps also have the disadvantage that they consume the most powerful encoding channels in the visualization toolbox - position and size - on an aspect that is held constant. This leaves less effective encoding channels like color for showing the dimension of interest." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"[…] no single visualization is ever quite able to show all of the important aspects of our data at once - there just are not enough visual encoding channels. […] designing effective visualizations to make sense of data is not an art - it is a systematic and repeatable process." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"A map by itself requires little explanation, but once data are superimposed, readers will probably need labels on the maps, and legends explaining encodings like the color of markers." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"One very common problem in data visualization is that encoding numerical variables to area is incredibly popular, but readers can’t translate it back very well." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

19 November 2022

SQL Reloaded: Tricks with Strings via STRING_SPLIT, PATINDEX and TRANSLATE

Searching for a list of words within a column can be easily achieved by using the LIKE operator:

-- searching for several words via LIKE (SQL Server 2000+)
SELECT * 
FROM Production.Product 
WHERE Name LIKE '%chain%'
   OR Name LIKE '%lock%'
   OR Name LIKE '%rim%'
   OR Name LIKE '%spindle%'

The search is quite efficient, if on the column is defined an index, a clustered index scan being more likely chosen.

If the list of strings to search upon becomes bigger, the query becomes at least more difficult to maintain. Using regular expressions could be a solution. Unfortunately, SQL Server has its limitations in working with patterns. For example, it doesn't have a REGEXP_LIKE function, which is used something like (not tested):

-- Oracle 
SELECT * 
FROM Production.Product 
WHERE REGEXP_LIKE(lower(Name), 'chain|lock|rim|spindle')

However, there's a PATINDEX function which returns the position of a pattern within a string, and which uses the same wildcards that can be used with the LIKE operator:

-- searching for a value via PATINDEX (SQL Server 2000+)
SELECT * 
FROM [Production].[Product] 
WHERE PATINDEX('%rim%', Name)>0

Even if together with the Name can be provided only one of the values, retrieving the values from a table or a table-valued function (TVF) would do the trick. If the values need to be reused in several places, they can be stored in a table or view. If needed only once, a common table expression is more indicated:

-- filtering for several words via PATHINDEX (SQL Server 2008+)
WITH CTE 
AS (
  -- table from list of values (SQL Server 2008+)
SELECT * FROM (VALUES ('chain') , ('lock') , ('rim') , ('spindle')) DAT(words) ) SELECT * FROM Production.Product PRD WHERE EXISTS ( SELECT * FROM CTE WHERE PATINDEX('%'+ CTE.words +'%', PRD.Name)>0 )

The query should return the same records as above in the first query!

Besides own's UDFs (see SplitListWithIndex or SplitList), starting with SQL Server 2017 can be used the STRING_SPLIT function to return the same values as a TVF:

-- filtering for several words via PATHINDEX & STRING_SPLIT (SQL Server 2017+)
SELECT * 
FROM Production.Product PRD
WHERE EXISTS (
	SELECT *
	FROM STRING_SPLIT('chain|lock|rim|spindle', '|') SPL
	WHERE PATINDEX('%'+ SPL.value +'%', PRD.Name)>0
	)

A dynamic list of values can be built as well. For example, the list of words can be obtained from a table and the STRING_SPLIT function:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Name, ' ') SPL
ORDER BY SPL.value

One can remove the special characters, the numeric values, respectively the 1- and 2-letters words:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Replace(Replace(Replace(Replace(Name, '-', ' '), ',', ' '), '/', ' '), '''', ' '), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

The output looks better, though the more complex the text, the more replacements need to be made. An alternative to a UDF (see ReplaceSpecialChars) is the TRANSLATE function, which replaces a list of characters with another. One needs to be careful and have a 1:1 mapping, the REPLICATE function doing the trick:

-- replacing special characters via TRANSLATE (SQL Server 2017+)
SELECT TRANSLATE(Name, '-,/''', Replicate(' ', 4))
FROM Production.Product PRD

Now the query becomes:

-- listing the words appearing in a column using TRANSLATE (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(TRANSLATE(Name, '-,/''', Replicate(' ', 4)), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

Happy coding!

05 November 2022

SQL Reloaded: STRING_AGG and STRING_SPLIT at Work, and a Bit of Pivoting

Working with strings across records was for long a nightmare for SQL developers until Microsoft introduced STRING_SPLIT in SQL Server 2016, respectively STRING_AGG in SQL Server 2017.  Previously, one was forced to write procedural language or use workarounds until SQL Server 2015, when recursive CTEs (common table expressions), Ranking and PIVOT were introduced, which allowed handling many scenarios. 

Microsoft provides several examples for the usage of STRING_SPLIT and STRING_AGG functions based on AdventureWorks database, though let's look at another example based on the same database. 

Let's say we want to show the concatenated Contacts for a store, result which can now easily be obtained by using the STRING_AGG:

-- concatenating names per store via STRING_AGG (SQL Server 2017+)
SELECT BusinessEntityID
, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
FROM Sales.vStoreWithContacts
GROUP BY BusinessEntityID
HAVING count(*)>1

Observe that is needed to use a GROUP BY to show one record per Store. Unfortunately, there isn't yet a window function available for the same. 

The inverse operation can be performed with the help of STRING_SPLIT table-valued function (TVF). (If you wonder why is needed a TVF, it is because the initial record needs to be multiplied by the generated output.)

-- reversing the concatenation (SQL Server 2017+)
WITH CTE
AS (
	-- concatenating names per store
	SELECT BusinessEntityID
	, STRING_AGG(Concat(FirstName, ' ', LastName), ';') Contacts
	FROM Sales.vStoreWithContacts
	GROUP BY BusinessEntityID
	HAVING count(*)>1
) 
SELECT CTE.BusinessEntityID
, DAT.Value
, DAT.Ordinal 
FROM CTE
    CROSS APPLY STRING_SPLIT(Contacts, ';', 1) DAT

STRING_SPLIT provides also an ordinal field, which can be used in theory in pivoting the values, though we'd return then from where we started. Instead of using the query just generated, let's exemplify an alternative solution which is available with SQL Server 2005 for concatenating strings across records:
 
-- concatenating names per store via PIVOT (SQL Server 2012+)
SELECT BusinessEntityID
, [1] Contact1
, [2] Contact2
, [3] Contact3
, [4] Contact4
, Concat([1], IsNull(';' + [2], ''), IsNull(';' + [3], ''), IsNull(';' + [4], '')) Contacts
FROM (
	-- concatenating names and adding a rank
	SELECT BusinessEntityID
	, Concat(FirstName, ' ', LastName) Contact
	, ROW_NUMBER() OVER(PARTITION BY BusinessEntityID ORDER BY FirstName) Ranking
	FROM Sales.vStoreWithContacts
) PER
PIVOT (
    Max(Contact)
	FOR Ranking IN ([1], [2], [3], [4])
) AS DAT

It's needed to rewrite the Concat function to port the code on SQL Server 2005 though. 

Talking about workarounds for splitting strings, in certain scenarios I used a combination of CutLeft & CutRight functions, which proved to be useful in data migrations, or use my own version of STRING_SPLIT (see SplitListWithIndex or SplitList). For concatenations I used mainly CTEs (see example) or cursors for exceptional cases (see example).

30 October 2022

SQL Reloaded: The WINDOW Clause in SQL Server 2022 (Part III: Ranking)

In two previous posts I shown how to use the newly introduced WINDOW clause in SQL Server 2022 for simple aggregations, respectively running totals, by providing some historical context concerning what it took to do the same simple aggregations as SUM or AVG within previous versions of SQL Server. Let's look at another scenario based on the previously created Sales.vSalesOrders view - ranking records within a partition. 

There are 4 ranking functions that work across partitions: Row_Number, Rank, Dense_Rank and NTile. However, in SQL Server 2000 only Row_Number could be easily implemented, and this only if there is a unique identifier (or one needed to create one on the fly):

-- ranking based on correlated subquery (SQL Server 2000+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
, (-- correlated subquery
  SELECT count(SRT.SalesOrderId)
  FROM Sales.vSalesOrders SRT
  WHERE SRT.ProductId = SOL.ProductId 
    AND SRT.[Year] = SOL.[Year]
	AND SRT.[Month] = SOL.[Month]
    AND SRT.SalesOrderId <= SOL.SalesOrderId
   ) RowNumberByDate
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
ORDER BY SOL.[Year]
, SOL.[Month]
, SOL.OrderDate ASC

As alternative for implementing the other ranking functions, one could use procedural language for looping, though this approach was not recommendable given the performance concerns.

SQL Server 2005 introduced all 4 ranking functions, as they are in use also today:

-- ranking functions (SQL Server 2005+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
-- rankings
, Row_Number() OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.OrderQty DESC) RowNumberQty
, Rank() OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.OrderQty DESC) AS RankQty
, Dense_Rank() OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.OrderQty DESC) AS DenseRankQty
, NTile(4) OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.OrderQty DESC) AS NTileQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
ORDER BY SOL.[Year]
, SOL.[Month]
, SOL.OrderQty DESC

Now, in SQL Server 2022 the WINDOW clause allows simplifying the query as follows by defining the partition only once:

-- ranking functions (SQL Server 2022+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
-- rankings
, Row_Number() OVER SalesByMonth AS RowNumberQty
, Rank() OVER SalesByMonth AS RankQty , Dense_Rank() OVER SalesByMonth AS DenseRankQty , NTile(4) OVER SalesByMonth AS NTileQty FROM Sales.vSalesOrders SOL WHERE SOL.ProductId IN (745) AND SOL.[Year] = 2012 AND SOL.[Month] BETWEEN 1 AND 3 WINDOW SalesByMonth AS (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.OrderQty DESC) ORDER BY SOL.[Year] , SOL.[Month] , SOL.OrderQty DESC

Forward (and backward) referencing of one window into the other can be used with ranking functions as well:
 
-- ranking functions with ascending/descending sorting (SQL Server 2022+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
-- rankings (descending)
, Row_Number() OVER SalesByMonthSortedDESC AS DescRowNumberQty
, Rank() OVER SalesByMonthSortedDESC AS DescRankQty
, Dense_Rank() OVER SalesByMonthSortedDESC AS DescDenseRankQty
, NTile(4) OVER SalesByMonthSortedDESC AS DescNTileQty
-- rankings (ascending)
, Row_Number() OVER SalesByMonthSortedASC AS AscRowNumberQty
, Rank() OVER SalesByMonthSortedASC AS AscRankQty
, Dense_Rank() OVER SalesByMonthSortedASC AS AscDenseRankQty
, NTile(4) OVER SalesByMonthSortedASC AS AscNTileQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
WINDOW SalesByMonth AS (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month])
, SalesByMonthSortedDESC AS (SalesByMonth ORDER BY SOL.OrderQty DESC)
, SalesByMonthSortedASC AS (SalesByMonth ORDER BY SOL.OrderQty ASC)
ORDER BY SOL.[Year]
, SOL.[Month]
, SOL.OrderQty DESC

Happy coding!


SQL Reloaded: The WINDOW Clause in SQL Server 2022 (Part II: Running Totals)

In the previous post I shown how to use the newly introduced WINDOW clause in SQL Server 2022 by providing some historical context concerning what it took to do the same simple aggregations as SUM or AVG within previous versions of SQL Server. Let's look at another scenario based on the previously created Sales.vSalesOrders view - using the same two functions in providing running totals. 

A running total is a summation of a sequence of values (e.g. OrderQty) ordered by a key, typically one or more columns that uniquely define the sequence, something like a unique record identifier (e.g. SalesOrderId). Moreover, the running total can be defined within a partition (e.g. Year/Month).

In SQL Server 2000, to calculate the running total and average for a value within a partition would resume implementing the logic for each attribute in correlated subqueries. One needed thus one subquery for each attribute, resulting in multiple processing of the base tables. Even if the dataset was already in the memory, it still involves a major overhead.

-- running total/average based on correlated subquery (SQL Server 2000+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
, (-- correlated subquery
  SELECT SUM(SRT.OrderQty)
  FROM Sales.vSalesOrders SRT
  WHERE SRT.ProductId = SOL.ProductId 
    AND SRT.[Year] = SOL.[Year]
	AND SRT.[Month] = SOL.[Month]
	AND SRT.SalesOrderId <= SOL.SalesOrderId
   ) RunningTotalQty
, (-- correlated subquery
  SELECT AVG(SRT.OrderQty)
  FROM Sales.vSalesOrders SRT
  WHERE SRT.ProductId = SOL.ProductId 
    AND SRT.[Year] = SOL.[Year]
	AND SRT.[Month] = SOL.[Month]
	AND SRT.SalesOrderId <= SOL.SalesOrderId
   ) RunningAvgQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
ORDER BY SOL.SalesOrderId 

Then SQL Server 2005 allowed consolidating the different correlated subqueries into one by using the CROSS APPLY join:

-- running total/average based on correlated subquery with CROSS APPLY (SQL Server 2005+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
, SRT.RunningTotalQty
, SRT.RunningAvgQty
FROM Sales.vSalesOrders SOL
     CROSS APPLY (-- correlated subquery
	  SELECT SUM(SRT.OrderQty) RunningTotalQty
	  , AVG(SRT.OrderQty) RunningAvgQty
	  FROM Sales.vSalesOrders SRT
	  WHERE SRT.ProductId = SOL.ProductId 
		AND SRT.[Year] = SOL.[Year]
		AND SRT.[Month] = SOL.[Month]
		AND SRT.SalesOrderId <= SOL.SalesOrderId
   ) SRT
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
ORDER BY SOL.SalesOrderId 

Even if SQL Server 2005 introduced window functions that work on a partition, an ORDER BY clause was supported only with SQL Server 2012:

-- running total/average based on window functions (SQL Server 2012+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
, SUM(SOL.OrderQty) OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.SalesOrderId) RunningTotalQty
, AVG(SOL.OrderQty) OVER (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.SalesOrderId) RunningAvgQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
ORDER BY SOL.SalesOrderId 

Now, in SQL Server 2022 the WINDOW clause allows simplifying the query as follows by defining the partition only once:

-- running total/average based on window functions and clause (SQL Server 2022+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
, SUM(SOL.OrderQty) OVER SalesByMonth AS RunningTotalQty
, AVG(SOL.OrderQty) OVER SalesByMonth AS RunningAvgQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
WINDOW SalesByMonth AS (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month] ORDER BY SOL.SalesOrderId)
ORDER BY SOL.SalesOrderId 

Moreover, the WINDOW clause allows forward (and backward) referencing of one window into the other:

-- running total/average based on window functions and clause (SQL Server 2022+)
SELECT SOL.SalesOrderId 
, SOL.ProductId
, SOL.OrderDate
, SOL.[Year]
, SOL.[Month]
, SOL.OrderQty
-- simple aggregations
, SUM(SOL.OrderQty) OVER SalesByMonth AS TotalQty
, AVG(SOL.OrderQty) OVER SalesByMonth AS AvgQty
-- running totals
, SUM(SOL.OrderQty) OVER SalesByMonthSorted AS RunningTotalQty
, AVG(SOL.OrderQty) OVER SalesByMonthSorted AS RunningAvgQty
FROM Sales.vSalesOrders SOL
WHERE SOL.ProductId IN (745)
  AND SOL.[Year] = 2012
  AND SOL.[Month] BETWEEN 1 AND 3
WINDOW SalesByMonth AS (PARTITION BY SOL.ProductId, SOL.[Year], SOL.[Month])
, SalesByMonthSorted AS (SalesByMonth ORDER BY SOL.SalesOrderId)
ORDER BY SOL.SalesOrderId 

Happy coding!

SQL Reloaded: Querying Azure Synapse Metadata for the D365 CRM & FO Tables Exported via Azure Synapse Link for Dataverse

Enabling Azure Synapse Link for Dataverse for Dynamics 365 CRM (D365 CRM) or for Dynamics 365 Finance & Operations (D365 FO) allows to have all the needed tables for reporting in a CDM structure, however the csv files with data don’t have any headers, which makes it challenging to use directly the files without the corresponding metadata. For example, attempting to define external tables would be useless without proper headers and data types. Fortunately, attribute’s name and data types are available in JSON files and can be queried. 

D365 CRM 

For CRM the files start with tables’ names and have the EntityMetadata.json postfix, the files being stored in the Microsoft.Athena.TrickleFeedService folder which lies with the other folders containing the data. Given that all files are in the same folder, the following query can be used to export the metadata for all or some of the tables. Just replace "(container name)" and "(data source)" with the corresponding values for your environment.

 
-- export definition for D365 CRM' CDM (all attributes, all tables)
SELECT DAT.EntityName
, DAT.AttributeName
, DAT.Timestamp
, DAT.AttributeType
, DAT.MetadataId
, DAT.Precision
, DAT.MaxLength
FROM openrowset(
        bulk '/(container name)/Microsoft.Athena.TrickleFeedService/*-EntityMetadata.json',
        data_source = '(data source)',
        format = 'csv',
        fieldterminator ='0x0b',
        fieldquote = '0x0b',
        rowterminator = '0x0b' --> You need to override rowterminator to read classic JSON
    ) with (doc nvarchar(max)) as rows
       CROSS APPLY OPENJSON(doc, '$.AttributeMetadata')--definitions[0].hasAttributes
       with (
         EntityName nvarchar(255) '$.EntityName'
        , AttributeName nvarchar(255) '$.AttributeName'
        , Timestamp nvarchar(50) '$.Timestamp'
        , AttributeType nvarchar(50) '$.AttributeType'
        , MetadataId nvarchar(100) '$.MetadataId'
        , Precision int '$.Precision'
        , MaxLength int '$.MaxLength'
     ) as DAT
WHERE DAT.EntityName IN ('lead', 'opportunity', 'product')

D365 FO
 

For Finance & Operations folders’ structure is more complex, a whole hierarchy of folders being built based on a set of predefined categories. One would need to traverse the hierarchy structure to find the files. Thus, it might be easier to generate the metadata for each table. Files’ names start with tables’ names and have the .cdm.json postfix. The below query generates the metadata for the InventTable table. Just replace the "(container name)", "(instance name)" and "(data source)" with the values for your environment.
 
-- export definition for D365 FO's CDM (all attributes, specific table)
SELECT DAT.name
, DAT.dataFormat
FROM openrowset(
        bulk '/(container name)/(instance name)/Tables/SupplyChain/ProductInformationManagement/Main/InventTable.cdm.json',
        data_source = '(data source)',
        format = 'csv',
        fieldterminator ='0x0b',
        fieldquote = '0x0b',
        rowterminator = '0x0b' --> You need to override rowterminator to read classic JSON
    ) with (doc nvarchar(max)) as rows
       CROSS APPLY OPENJSON(doc, '$.definitions[0].hasAttributes')
       with (
          name nvarchar(255) '$.name'
        , dataFormat nvarchar(50) '$.dataFormat'
     ) as DAT

Further Steps

One can obtain thus the needed metadata, however after a first inspection, the data types are too general compared with the ones considered for the data source attributes. One can work probably with these data types as well, though there's a Microsoft recommendation to minimize the row length by using the smallest possible column size, which leads to better query performance. Exporting the metadata from the source system and matching the two datasets based on tables and attributes’ names would allow addressing this recommendation, even if this implies checking from time to time whether their definition changed. The trick is to keep the same sorting order like in the Synapse files. Unfortunately, not all SQL Server data types are supported (e.g. text, ntext, sql_variant, etc.) or are ideal to work with (e.g. money), however there are alternative data types that can be used. 
Related Posts Plugin for WordPress, Blogger...