SQL Troubles: October 2023

21 October 2023

Graphical Representation: Overreaching in Data Visualizations

Graphical Representation Series

One of the most important aspects to stress in the context of graphical design is the purpose of graphical representations and the medium in which they are communicated. For example, one needs to differentiate between the graphics propagated on the various media channels that target the public consumption and potential customers (books, newspapers, articles in paper or paperless form, respectively blog posts and similar content) and graphics made for organizational use (reports, dashboards or presentations).

If the former graphics are supposed to back up a story, the reader being led into one direction or another, the author having the freedom of choosing the direction and the message, in the latter, unless the content is supposed to support, persuade or force a decision, the facts and data need to be presented in an equidistant manner, in a form that support insights, decision making or further inference. This applies to data professionals as well to the business users preparing the data.

Data visualization authors tend to use the title and subtitle to highlight in reports and dashboards the most important findings as per their perception, sometimes even stating the obvious. One of the issues with this approach is that the audience might just pick up the respective information without further looking at the chart, missing maybe more important facts. Just highlighting an element in the graphic or providing explanatory headlines is not storytelling, even if it helps in the process. Ideally, the data itself as depicted by the visuals should tell the story! Further information with storytelling character should be provided in the presentation of the data and taylored accordingly for the audience!

With a few exceptions, the information and decisions shouldn't be forced on the audience. There are so many such examples on the various social networks in which data analysts or other types of data professionals seem to imply this in the content they share and this is so wrong on many levels!

No matter how deep a data professional is involved into the business and no matter how extensive is his/her knowledge about the systems, data and processes, the business user and the manager are the closest to the business context and needs, while data professionals might not be aware of the full extent. This lack of context makes it challenging to interpret the trends depicted by the data, respectively to associate the changes observed in trends with decisions made or issues the business dealt with. When such knowledge is not available the data professional tends to extrapolate instead of identifying the chain of causality together with the business (and here annotation capabilities would help considerably).

Moreover, it falls on management's shoulders to decide which facts, data, metrics, KPIs and information are important for the organization. A data professional can make recommendations, can play with the data and communicate certain insights, gaps or courses of action, though the management decides what's important and how the respective information should be communicated! Overstepping the boundaries can easily lead to unnecessary conflict in which the data professional can easily lose, even if the facts are in his favor. It's enough to deal with missing or incorrect information for the whole story to fall apart.

It's true that some of the books on graphical design use various highlighting techniques in the explanation process, but they are intended for the readers to understand what the authors want to say. Unfortunately, there are also examples improperly used or authors' opinion diverge from the common sense. Independently of this, the data professional should develop own visual critical thinking and validate the techniques used against own judgement!

Previous Post <<||>> Next Post

Data Warehousing: Building a Data Lakehouse for Dynamics 365 Environments with Serverless SQL Pool

Data Warehousing Series

One of the major limitations of Microsoft Dynamics 365 is the lack of direct access to the production databases for reporting purposes via standard reporting or ETL/ELT tools. Of course, one can attempt to use OData-based solutions though they don't scale with the data volume and imply further challenges.

At the beginning, Microsoft attempted to address this limitation by allowing the export of data entities into customer's Azure SQL database, feature known as bring your own database (BYOD). Highly dependent on batch jobs, the feature doesn't support real-time synchronization and composite entities, and is dependent on the BYOD's database capacity, the scalability after a certain point becoming a bottleneck.

Then Microsoft started to work on two solutions for synchronizing the Dynamics 365 data in near-real time (cca. 10-30 minutes) to the Data Lake: the Export to Data Lake add-in (*), respectively the Azure Synapse Link for Dataverse with Azure Data Lake. The former allows the synchronization of the tables from Finance & Operations (doesn't work for CRM) to files that reflect the database model from the source. In exchange, the latter allows the synchronization of data entities to similar structures, and probably will support tables as well. Because the service works via Dataverse it supports also the synchronization of CRM data.

The below diagram depicts the flow of data from the D365 environments to the Data Lake, the arrow indicating the direction of the flow. One arrow could be drawn between Dynamics 365 Finance & Operations and the Azure Link for Datavetse service, though one may choose to use only the Export to Data Lake add-in given that a data model based on the tables offers more flexibility (in the detriment of effort though). Data from other systems can be exported via pipelines to the Data Lake to provide an integrated and complete view of the business.

Serverless Data Lakehouse

Once the data available in the Delta Lake, it can be consumed directly by standard and Power BI paginated reports, however building the data model will involve considerable effort and logic duplication between reports. Therefore, it makes sense to prepare the data upfront in the Data Lake, when possible, and here the serverless SQL pool can help building an enterprise data model. The former approach can still be used for rapid prototyping or data discovery.

The serverless SQL Server pool is a stateless SQL-based distributed data processing query service over Azure data lake for large-scale data and computational functions. Even if it doesn't support standard tables, it allows to make the data from the Data Lakes files available for processing via external tables, a mechanism that maps files' structure to an entity that can be queried like a normal view (though it supports only read operations).

Further on, the enterprise data model can be built like in a normal Data Warehouse via the supported objects (views, stored procedures and table-valued functions). These objects can be called from standard and Power BI paginated reports, the queries being processed at runtime anew, which might result occasionally in poor performance. However, the architecture is supposed to scale automatically as needed.

If further performance is needed, parts of the logic or the end-result can be exported to the Data Lake and here the Medallion Architecture should be considered when appropriate. Upon case, further processing might be needed to handle the limitations of the serverless SQL pool (e.g.flattening hierarchies, handling data quality issues).

One can go around the lack of standard table support needed especially for value mappings by storing the respective data as files and/or occasionally by misusing views, respectively by generating Spark tables via the Spark pool.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Previous Post <<||>> Next Post

Data Warehousing: Dynamics 365, the Data Lakehouse and the Medallion Architecture

Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well!

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy.

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches.

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

20 October 2023

SQL Reloaded: Extended LTrim/RTrim in SQL Server 2022 (Before and After)

In SQL Server 2022, the behavior of LTrim (left trimming) and RTrim (right trimming) functions was extended with one more string parameter. When provided, the engine checks whether the first parameter starts (for LTrim), respectively ends (for RTrim) with the respective value and removes it, the same as the space character char(32) was removed previously:

-- prior behavior of LTrim/RTrim
DECLARE @text as nvarchar(50) = '  123  '
SELECT '(' + LTrim(@text) + ')' LeftTrimming
, '(' + RTrim(@text) + ')' RightTrimming
, '(' + Ltrim(RTrim(@text)) + ')' Trimming1 -- prior SQL Server 2017
, '(' + Trim(@text) + ')' Trimming2 -- starting with SQL 2017

LeftTrimming	RightTrimming	Trimming1	Trimming2
(123 )	( 123)	(123)	(123)

Here's the new behavior:

-- extended behavior of LTrim/LTrim (SQL Server 2022+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT LTrim(@text , '123') LeftTrimming
, RTrim(@text , 'abc') RightTrimming;

LeftTrimming	RightTrimming
abc123abc	123abc123

Previosly, to obtain the same result one could write something like:

-- prior solution via Left/Right for the same (SQL Server 2000+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT CASE WHEN Left(@text, 3) = '123' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming
, CASE WHEN Right(@text, 3) = 'abc' THEN Left(@text,Len(@text)-3) ELSE @text END  RightTrimming

-- prior solution via "LIKE" for the same (SQL Server 2000+)
DECLARE @text as nvarchar(50) = '123abc123abc'
SELECT CASE WHEN @text LIKE '123%' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming
, CASE WHEN @text LIKE '%abc' THEN Left(@text,Len(@text)-3) ELSE @text END  RightTrimming

As can be seen, the syntax is considerable simplified. However, there are few the situations when is needed. In the past I had to write code to remove parenthesis, quotes or similar characters:

-- removing parantheses
DECLARE @text as nvarchar(50) = '(testing)'
SELECT LTrim(@text , '(') LeftTrimming
, RTrim(@text , ')') RightTrimming
, RTrim(LTrim(Trim(@text), '('), ')') Trimming 

-- removing double quotes
DECLARE @text as nvarchar(50) = '"testing"'
SELECT LTrim(@text , '"') LeftTrimming
, RTrim(@text , '"') RightTrimming
, RTrim(LTrim(Trim(@text), '"'), '"') Trimming

The Trim for the 3rd value in both queries was used to remove the eventual spaces before the character to be replaced:

-- removing paranteses with lead/end spaces
SELECT RTrim(LTrim(Trim('   (testing)   '), '('), ')');

Then I thought, maybe I could use the same to remove the tags from an XML element. I tried the following code and unfortunately it doesn't seem to work:

-- attempting to remove the start/end tags from xml elements
DECLARE @text as nvarchar(50) = '<string>testing</string>'
SELECT LTrim(@text , '<string>') LeftTrimming
, RTrim(@text , '</string>') RightTrimming
, RTrim(LTrim(Trim(@text), '<string>'), '</string>') Trimming

LeftTrimming	RightTrimming	Trimming
esting</string>	<string>te	e

That's quite an unpleasant surprise! In exchange, the value type can be defined as XML and use the following code to obtain the needed result:

-- extracting the value from a tag element
DECLARE @text XML = '<string>testing</string>'
SELECT @text.query('data(/string)') as value

Happy coding!

19 October 2023

Graphical Representation: Graphics We Live By II (Discount Rates)

Graphical Representation Series

It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].

So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings.

A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best.

Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations.

One can start with a few related questions:

(1) Does it make sense to use a chart to represent 5 values which have small variability (the difference between the first and last value is of only 6%)?

(2) Does it make sense to use a chart only for the sake of visualizing the data?

(3) Where is the benefit for using a chart as long there's no information conveyed?

One can see similar examples in the media where non-aggregated values are shown in a chart just for the sake of visualizing the data. Sometimes the authors compensate for the lack of meaning with junk elements, fancy titles or other tricks. Usually, sense-making in a chart takes longer than looking at the values in a table as there are more dimensions or elements to consider. For a table there's the title, headers and the values, nothing more! For a chart one has in addition the axes and some visualization elements that can facilitate or complicate visualization's decoding. Where to add that there are also many tricks to distort the data.

Tables tend to maximize the amount of digital ink used to represent the data, and minimize the amount used to represent everything else not important to understanding. It's what Tufte calls the data-to-ink ratio (see [1]), a second important principle. This can be translated in (a) removing the border of the chart area, (b) minimizing the number of gridlines shown, (c) minimizing the number of ticks on the axis without leading to information lost, (d) removing redundant information, (e) or information that doesn't help the reader.

However, the more data is available in the table, the more difficult it becomes to navigate the data. But again, if the chart shows the individual data without any information gained, a table might be still more effective. One shouldn't be afraid to show a table where is the case!

(4) I have a data visualization, what's next?

Ideally, the data professional should try to obtain the maximum of effect with minimum of elements. If this principle aims for the efficiency of design, a fourth related principle aims for the efficiency of effort - one should achieve a good enough visualization with a minimum of effort. Therefore, it's enough maybe if we settle to any of the two above results.

On the other side, maybe by investing a bit more effort certain aspects can be improved. In this area beginners start playing with the colors, formatting the different elements of the chart. Unfortunately, even if color plays a major role in the encoding and decoding of meaning, is often misused/overused.

(5) Is there any meaning in the colors used?

In the next examples taken from the web (diagram C and D), the author changed the color of the column with the minimal value to red to contrast it with the other values. Red is usually associated with danger, error, warning, or other similar characteristics with negative impact. The chances are high that the reader will associate the value with a negative connotation, even if red is used also for conveying important information (usually in text). Moreover, the reader will try to interpret the meaning of the other colors. In practice, the color grey has a neutral tone (and calming effect on the mind). Therefore, it's safe to use grey in visualization (see diagram D in contrast with diagram C). Some even advise setting grey as default for the visualization and changing the colors as needed later.

In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader.

(6) What's in a name?

The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong.

In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data.

As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".

(7) What about meaning?

Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.

We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong!

Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead.

(8) What do readers look for?

Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information.

Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis.

Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!

Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.

For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!

The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.

Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).

(9) How about the shape of data?

When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable.

(10) Which chart should be used?

As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well.

In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.

As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!

Previous Post <<||>> Next Post

Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"

14 October 2023

Azure Tools & Services: The Azure Diagrams Architecture Advisor (Test Drive)

There seems to be a new tool available for the Azure community, even if the tool doesn't belong yet to the Microsoft technology stack. I'm talking about Azure Diagrams Architecture Advisor, which enables users to "design diagrams in a collaborative manner and provides guidance on which services can be integrated and when they should be utilized in your architecture".

I took the tool for a test drive, trying to adapt an architecture I used recently for a Synapse Data Lakehouse. I used the Data Warehouse using Synapse Analytics example as source of inspiration. (You can use the links to check the latest versions.)

Overall, it was a good experience, the functionality being straightforward to use. I needed some time to identify what resources are available, and there are a few. Even if some of the resources were missing (e.g. Dataverse and Dynamics 365 services), I could use a custom resource to fill the gap.

One can easily drag and drop resources (see the left panel) on the canvas and create links between them. The diagram can be saved only as soon as you give it a name and sign in for an account, the email and a password are all you need for that. The diagram can be kept hidden or made available for the whole community, respectively can be saved locally as PNG or SVG files.

Serverless Data Lakehouse

Each standard (non-custom) resource has a link to documentation, respectively to the pricing. The arrows between resources show whether the integration is supported by default. One can create bidirectional arrows by creating a second arrow from destination to source.

The functionality is pretty basic, though I was able to generate the diagram I wanted, and I'm satisfied with the result. Unfortunately, I couldn't modify directly the example I used for inspiration. It would be great if further functionality like the one available in Visio Online could be made available (e.g. automatic alignment, copy pasting whole diagrams or parts of them, making each value editable, grouping several resources together, adding more graphical content, etc.).

The resources and icons available focus on the Microsoft Azure technology stack. It would be great if one could import at least new icons, though ideally further resources from other important software vendors should be supported (e.g. AWS, Oracle, Adobe, etc.).

The possibility to define further links (e.g. to the actual environments) or properties (e.g. environment names), respectively to change the views for the whole diagram could enhance the usability of such diagrams.

Microsoft should think seriously about including this tool in their stack, if not in Visio Online at least as standalone product!

Besides the examples and the shared diagrams available you can browse also through the collection of Azure Architectures.

To summarize, here are the improvements suggested:
- save and modify directly the examples;
- automatic alignment of resources within the canvas;
- copy pasting whole diagrams or parts of them;
- make each value editable;
- group several resources together (e.g. in a layer);
- adding more graphical content;
- import new icons;
- support resources from other important software vendors (e.g. AWS, Oracle, Adobe, etc.);
- define further links (e.g. to the actual environments);
- define further properties (e.g. environment names);
- change the views for the whole diagram;
- integration into Visio Online.

Graphical Representation: On Insights II (The Complexity Perspective)

Graphical Representation Series

Scientists attempt to discover laws and principles, and for this they conduct experiments, build theories and models rooted in the data they collect. In the business setup, data professionals analyze the data for identifying patterns, trends, outliers or anything else that can lead to new information or knowledge. On one side scientists chose the boundaries of the systems they study, while for data professionals even if the systems are usually given, they can make similar choices.

In theory, scientists are more flexible in what data they collect, though they might have constraints imposed by the boundaries of their experiments and the tools they use. For data professionals most of the data they need is already there, in the systems the business uses, though the constraints reside in the intrinsic and extrinsic quality of the data, whether the data are fit for the purpose. Both parties need to work around limitations, or attempt to improve the experiments, respectively the systems.

Even if the data might have different characteristics, this doesn't mean that the methods applied by data professionals can't be used by scientists and vice-versa. The closer data professionals move from Data Analytics to Data Science, the higher the overlap between the business and scientific setup.

Conversely, the problems data professionals meet have different characteristics. Scientists outlook is directed mainly at the phenomena and processes occurring in nature and society, where randomness, emergence and chaos seem to feel at home. Business processes deal more with predefined controlled structures, cyclicity, higher dependency between processes, feedback and delays. Even if the problems may seem to be different, they can be modeled with systems dynamics.

Returning to data visualization and the problem of insight, there are multiple questions. Can we use simple designs or characterizations to find the answer to complex problems? Which must be the characteristics of a piece of information or knowledge to generate insight? How can a simple visualization generate an insight moment?

Appealing to complexity theory, there are several general approaches in handling complexity. One approach resides in answering complexity with complexity. This means building complex data visualizations that attempt to model problem's complexity. For example, this could be done by building a complex model that reflects the problem studied, and build a set of complex visualizations that reflect the different important facets. Many data professionals advise against this approach as it goes against the simplicity principle. On the other hand, starting with something complex and removing the nonessential can prove to be an approachable strategy, even if it involves more effort.

Another approach resides in reducing the complexity of the problem either by relaxing the constraints, or by breaking the problem into simple problems and addressing each one of them with visualizations. Relaxing the constraints allow studying upon case a more general problem or a linearization of the initial problem. Breaking down the problem into problems that can be easier solved, can help to better understand the general problem though we might lose the sight of emergence and other behavior that characterize complex systems.

Providing simple visualizations to complex problems implies a good understanding of the problem, its solution(s) and the overall context, which frankly is harder to achieve the more complex a problem is. For its understanding a problem requires a minimum of knowledge that needs to be reflected in the visualization(s). Even if some important aspects are assumed as known, they still need to be confirmed by the visualizations, otherwise any deviation from assumptions can lead to a new problem. Therefore, its questionable that simple visualizations can address the complexity of the problems in a general manner.

Previous Post <<||>> Next Post

Graphical Representation: On Insights I ('Aha' Moments)

Graphical Representation Series

On one side scientists talk about 'Insight' with a sign of reverence when referring to the processes, patterns, models, metaphors, stories and paradigms used to generate and communicate insight. Conversely, data professionals seem to regard 'Insight' as something trivial, achievable just by picking and combining the right visualizations and storytelling. Are the scientists exaggerating when talking about insight, or do the data professionals downplay the meaning and role of insight? Or maybe the scientific and business contexts have incomparable complexity, even if the same knowledge toolset are used?

One probably can't deny the potentiality of tools or toolsets like data visualization or data storytelling in providing new information or knowledge that leads to insights, though between potential usefulness and harnessing that potential on a general basis there's a huge difference, no matter how much people tend to idealize the process (and there's lot of idealization going on). Moreover, sometimes the whole process seems to look like a black box in which some magic happens and insight happens.

It's challenging to explain the gap as long as there's no generally accepted scientific definition of insights, respectively an explanation of how insights come into being. Probably, the easiest way to recognize their occurrence is when an 'Aha' moment appears, though that's the outcome of a process and gives almost no information about the process itself. Thus, insight occurs when knowledge about the business is acquired, knowledge that allows new or better understanding of the data, facts, processes or models involved.

So, there must be new associations that are formed, either derived directly from data or brought to surface by the storytelling process. The latter aspect implies that the storyteller is already in possession of the respective insight(s) or facilitates their discovery without being aware of them. It further implies that the storyteller has a broader understanding of the business than the audience, which is seldom the case, or that the storyteller has a broader understanding of the data and the information extracted from the data, and that's a reasonable expectation.

There're two important restrictions. First, the insight moments must be associated with the business context rather than with the mere use of tools! Secondly, there should be genuine knowledge, not knowledge that the average person should know, respectively the mere confirmation of expectations or bias.

Understanding can be put in the context of decision making, respectively in the broader context of problem solving. In the latter, insight involves the transition from not knowing how to solve a problem to the state of knowing how to solve it. So, this could apply in the context of data visualization as well, though there might exist intermediary steps in between. For example, in a first step insights enable us to understand and define the right problem. A further step might involve the recognition of the fact the problem belongs to a broader set of problems that have certain characteristics. Thus, the process might involve a succession of 'Aha' moments. Given the complexity of the problems we deal with in business or social contexts, that's more likely to happen. So, the average person might need several 'Aha' moments - leaps in understanding - before the data can make a difference!

Conversely, new knowledge and understanding obtained over successive steps might not lead to an 'Aha' moment at all. Whether such moments converge or not to 'Aha' moments may rely on the importance of the overall leap, though other factors might be involved as well. In the end, the emergence of new understanding is enough to explain what insights mean. Whether that's enough is a different discussion!

Previous Post <<||>> Next Post

10 October 2023

Base Enums Metadata in Dynamics 365 for Finance and Operations

The list of values that don't have their own tables are managed within the application as (Base) Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values.

Fortunately, there's a way to go around this limitation. First, there's the old obsolete AX 2012 documentation for Base Enums, which documents the Names and Descriptions for the Enum Values (e.g., search for CustVendNegInstStatus). As Microsoft stopped maintaining the documentation, the changes made in D365 are not reflected.

Secondly, the mapping between Values and Names is stored in two of the tables available in D365, the following query allowing to retrieve the mapping:

-- Metadata - Base Enum Values 
SELECT EIT.Id 
, EIT.Name EnumName
, EVT.EnumValue EnumValueId
, EVT.Name EnumValueName 
, ' WHEN ' + Cast(EVT.EnumValue as nvarchar(10)) + ' THEN ''' + EVT.Name + '''' [CASE Statement]
FROM dbo.EnumValueTable EVT
     JOIN dbo.EnumIdTable EIT
       ON EVT.EnumId = EIT.Id
WHERE EIT.Name LIKE 'CustVendNegInstStatus%'
--WHERE EVT.Name LIKE '%Redrawn%'
ORDER BY EIT.Name 
, EVT.EnumValue

The [Case Statement] column generates already the statement for the CASE. One can search for the Base Enum, respectively for the value itself.

Unfortunately, the metadata are not complete and, as there's no timestamp, it's not possible to identify what and when changed. Therefore, one needs to regularly check the changes manually, or store a copy of the respective data as baseline, thus being able to compare the latest metadata against the baseline. The solution is not ideal, but it does the trick.

To identify the Base Enum corresponding to a field one can export the script behind a form when the functionality is available. Otherwise, one can use the Table Browser and generate the script from the corresponding source table.

Occasionally, the value stored in the above tables may deviate slightly from the expectation. It's useful then to search only for a substring, use both sources and even search on the internet for more information.

When building reports or a semantic layer on top of D365 data model, for almost all important tables is needed to retrieve such mappings. It's a good idea to encapsulate the mappings in single "base views" and reuse it when needed, avoiding thus to replicate the mappings for each use of the source table (see the logic for TDM.vEcoResProduct).

The data generated by the above script could be dumped to a table and thus the table joined in the queries when the values for a Base Enum are needed, though I don't recommend such an approach because it would increase the complexity of the queries, already complex in certain scenarios. Also the misuse of views for creating pseudoconstants is not recommended.

Happy coding!

Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

Data Warehousing Series

Introduction

Besides the fact that data professionals don't have direct access to D365 F&O production environments (direct access is available only to sandboxes), which was from the beginning an important constraint imposed by the architecture, there are a few more challenges that need to be addressed when working with the data.

Case Sensitiveness

SQL Server is not case sensitive, therefore, depending on the channel though which the data came, values appear either in upper or lower case, respectively a mixture of both. Even if this isn't an issue in D365, it can become an issue when the data leave the environment. E.g., PowerQuery is case sensitive (while DAX is case insensitive), thus, if a field containing a mix of values participate in a join or aggregation, this will result in unexpected behavior (e.g., duplicates, records ignored). It's primarily the case of the Company (aka DataAreaId) field available in most of the important tables.

The ideal solution would be to make sure that the values are correct by design, however this can't be always enforced. Otherwise, when using the data outside of D365 F&O the solution would be to transform all the values in upper case (or lower case). However, also this step might occur too late. E.g., when the data are exported to the Azure Data Lake in parquet file format.

Unique Keys

A unique record in D365 F&O was in earlier versions usually identified by the RecId and DataAreaId, while later the Partition field was added. This means that most of the joins will need to consider all 3 columns, which adds some overhead. In some environments there's only a Partition defined (and thus the field can be ignored), however this is not a warranty.

As long developers use SQL there's no issue of using multiple fields in JOINs, though in PowerQuery there must be created a unique key based on the respective records so the JOINs are possible. Actually, also SQL-based JOINs would benefit if each record would be identified by one field.

Audit Metadata

Not all tables have fields that designate the date when a record was created or last modified, respectively the user who performed the respective action. The fields can be added manually when setting up the system, however that's seldom done. This makes it difficult to audit the records and sometimes it's a challenge also for reporting, respectively for troubleshooting the differences between DWH and source system. Fortunately, the Export to Data Lake adds a timestamp reflecting the time when the record was synchronized, though it can be used then only for the records synchronized after the first load.

Tables vs. Entities

Data are modified in D365 F&O via a collection of entities, which are nothing but views that encapsulate the business logic, being based on the base tables or other views, respectively a combination of both. The Export to Data Lake (*) is based on the tables, while Link to Data Lake is based on data entities.

Using the base tables means that the developer must reengineer the logic from the views. For some cases it might work to create the entities as views in the DWH environment though some features might not be supported. It's the case of serverless and dedicated SQL pools, that support only a subset from the features available under standard Azure SQL Server.

The developer can try to replicate the logic from entities, considering only the logic needed by the business, especially when only a subset from the functionality available in the entity was used. The newly created views can become thus more readable and maintainable. On the other side, if the logic in entity changed, the changes need to be reflected also in the DWH views.

Using the entity-based data makes sure that the data are consistent between environments. Unfortunately, Microsoft found out that isn't so easy to synchronize the data at entity level. Moreover, there are multiple entities based on the same table that reflect only a subset of the columns or rows. Thus, to cover all the fields from a base table, one might be forced to synchronize multiple views, leading thus to data duplication.

In theory, both mechanisms can be used within the same environment, even if this approach is against the unique source of truth principle, when data are duplicated.

Data Validation in the Data Lake

One scenario in which both sources are useful is when validating whether the synchronization mechanism worked as expected. Thus, one can compare the number of records and check whether there are differences that can't be mitigated. However, does it make sense to "duplicate" database objects only for this purpose?

Ideally, to validate whether a record was synchronized should be done in the source environment (e.g. via a timestamp). That's difficult to achieve, especially when there's no direct access to the source database (as is the case for Production databases). Fortunately, Dataverse provides this functionality, even if might not be bullet proof.

In extremis, the most reliable approach is to copy the production environment on a sandbox and do a count of records for each table, using as baseline for comparison the time when the refresh occurred.

Base Enum Values

The list of values that don't have their own tables are managed within the application as Base Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values. Some of the mappings between the codes and values are documented in two system tables, and even in old language-based documentation, though both sources are far from complete. As alternative, one can try to discover the values in the system.

Unfortunately, the mappings need to be repeated when the Enum-based attributed is used in multiple places. One can reduce mapping's duplication by encapsulating the logic into a view (aka "base view") and reused accordingly (see the logic for TDM.vEcoResProduct).

Even if the values for many of the Enums are stored into the EnumValueTable table, Enum's name being available in EnumIdTable table, it's not a good idea to retrieve the values via a JOIN in the business logic. This would complicate the business logic unnecessarily. A CASE is more efficient even if occasionally more difficult to maintain. Unfortunately, there's no timestamp to identify which values were added lately.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

SSRS (& Paginated Reports): Products Master Report in Dynamics 365 for Finance & Operations via Base Tables

As mentioned in the previous post, building the Products Master paginated report via the base D365 FO tables synchronized in Data Lake involves more effort and implies duplicating the logic from the used data entity(-ies). One can attempt recreating the entities 1:1 as they are defined in the D365 database (an SQL Server Azure database), however there are high the chances that in certain areas (e.g. financial dimensions) the features used in building the logic are not supported. Moreover, it might be the case that only a subset of the functionality was implemented, and with this, pieces of logic becoming obsolete.

There's also a limitation on the number of tables which can be synchronized, and therefore one needs to compromise. In the dbo.EcoResProductV2Entity there are about 20 tables involved, though probably only half of them are used. The base table dbo.EcoResProduct contains most of the fields available also in the data entity, though the names for the various attributes need to be retrieved from the various tables.

Some of the issues met in working with data entities appear in this scenario as well: (2) there are fields based on Enums that store only the code and it's needed to map the corresponding values, (4) further transformations are needed (e.g., converting fields to formats). It makes sense to encapsulate similarly the logic in views, though this time one or two layers of logic are needed in addition. One can choose to do this for each table in scope, or replicate data entity's logic by focusing only on the needed parts. Both approaches will prove to be challenging in certain scenarios. Let's consider the latter:

CREATE OR ALTER VIEW TDM.vEcoResProductV2Entity
AS 
/*
name:  Products - Master (base view)
created: 01.04.2021
modified: 01.04.2021
*/
SELECT ITM.ProductType
, CASE ITM.ProductType
    WHEN 1 THEN 'Item'
	WHEN 2 THEN 'Service'
  END ProductTypeName
, CAST ((CASE WHEN ITM.InstanceRelationType = 13678 THEN 1 ELSE 2 END) AS INT) ProductSubtype
, CASE 
    WHEN ITM.InstanceRelationType = 13678 THEN 'Product'
	ELSE 'Product Master'
   END ProductSubtypeName
, ITM.DisplayProductNumber ProductNumber
, Replace(Replace(ILT.Name, char(10), ' '), char(13), ' ')  ProductName
, Replace(Replace(ILT.Description, char(10), ' '), char(13), ' ') ProductDescription
, PCA.CategoryName RetailProductCategoryName 
, PCA.CategoryCode RetailProductCategoryCode
, PDG.Name ProductDimensionGroupName 
, SDG.Name StorageDimensionGroupName 
, TDG.Name TrackingDimensionGroupName 
, ITM.RetailColorGroupId ProductColorGroupId 
, ITM.RetailSizeGroupId ProductSizeGroupId 
, ITM.RetailStyleGroupId ProductStyleGroupId
, ITM.VariantConfigurationTechnology
, CASE ITM.VariantConfigurationTechnology
	WHEN 0 THEN 'None'
	WHEN 1 THEN 'Predefined Variants'
    WHEN 2 THEN 'Dimension Based'
	WHEN 3 THEN 'RuleBased'
	WHEN 4 THEN 'Constraint Based'
  END VariantConfigurationTechnologyName
, CASE WHEN KIT.ProductMaster IS NOT NULL THEN 1 ELSE 0 END IsProductKit
, CASE WHEN KIT.ProductMaster IS NOT NULL  THEN 'Yes' ELSE 'No' END IsProductKitName
, ITM.PDSCWProduct IsCatchWeightproduct
, CASE ITM.PDSCWProduct WHEN 1 THEN 'Yes' ELSE 'No' END IsCatchWeightproductName
, ITM.IsProductVariantUnitConversionEnabled
, CASE ITM.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName
-- system
, PPD.ProductDimensionGroup ProductDimensionGroupRecId
, PSG.StorageDimensionGroup StorageDimensionGroupRecId
, PTD.TrackingDimensionGroup TrackingDimensionGroupRecId
, PCA.RetailCategoryRecId
, ITM.RecId
, ITM.Partition
FROM dbo.EcoResProduct ITM
     LEFT JOIN dbo.EcoResProductTranslation ILT
	   ON ITM.RecId = ILT.Product 
	  AND ITM.Partition = ILT.Partition 
	  AND ILT.LanguageId = 'en-us'
     LEFT JOIN dbo.EcoResProductDimensionGroupProduct PPD
	   ON ITM.RecId = PPD.Product
	  AND ITM.Partition = PPD.Partition
	      LEFT JOIN dbo.EcoResProductDimensionGroup PDG
		    ON PPD.ProductDimensionGroup = PDG.RecId 
		   AND PPD.Partition = PDG.Partition 
	 LEFT JOIN dbo.EcoResStorageDimensionGroupProduct PSG
	   ON ITM.RecId = PSG.Product
	  AND ITM.Partition = PSG.Partition
	      LEFT JOIN dbo.EcoResStorageDimensionGroup SDG
		    ON PSG.StorageDimensionGroup = SDG.RecId 
		   AND PSG.Partition = SDG.Partition 
	 LEFT JOIN dbo.EcoResTrackingDimensionGroupProduct PTD
	   ON ITM.RecId = PTD.Product
	  AND ITM.Partition = PTD.Partition
	      LEFT JOIN dbo.EcoResTrackingDimensionGroup TDG
		    ON PTD.TrackingDimensionGroup = TDG.RecId 
		   AND PTD.Partition = TDG.Partition 
	 LEFT JOIN (-- product retail category 
	   SELECT PCA.Product 
	   , PCA.Partition 
	   , CAT.Code CategoryCode 
	   , CAT.Name CategoryName
	   , PCA.Category RetailCategoryRecId
	   FROM dbo.EcoResProductCategory PCA
	        JOIN dbo.EcoResProductRetailCategoryHierarchy RCH
			  ON PCA.CategoryHierarchy = RCH.RetailCategoryHierarchy
			 AND PCA.Product = RCH.Product 
			 AND PCA.Partition = RCH.Partition 
			JOIN dbo.EcoResCategory CAT
			  ON PCA.Category = CAT.RecId 
			 AND PCA.Partition = CAT.Partition 
		 ) PCA
	   ON ITM.RecId = PCA.Product 
	  AND ITM.Partition = PCA.Partition 
	  LEFT JOIN dbo.RetailKit KIT
	    ON ITM.RecId = KIT.ProductMaster 
	   AND ITM.Partition = KIT.Partition
WHERE NOT(ITM.InstanceRelationType  =  4211)

As can be seen, the logic is quite complex and only half of the tables were used. There will be entities even more complex than this (the query was restructured for understandability). On the other hand, there will be environments where only half from the above tables will be used (e.g., when no inventory and/or dimensions are needed).

To test the view, one just needs to change the schema referenced in the logic. The view is created under the TDM (Table Data Model) schema, so there should be no issues as long the schemas are used correctly. However, when duplicating logic, one should check whether the values match, respectively whether the objects have the same record count:

-- checking whether the values match for an example
SELECT * 
FROM TDM.vEcoResProductV2Entity
WHERE ProductNumber = '0169'

SELECT * 
FROM EDM.vEcoResProductV2Entity
WHERE ProductNumber = '0169'


-- checking the number of records
SELECT count(*)
FROM TDM.vEcoResProductV2Entity

SELECT count(*)
FROM EDM.vEcoResProductV2Entity

There are several optimizations or restructuring opportunities in the logic, e.g., the 'product retail category' subquery can be created as a separate view. Also, the groupings of two tables for the 'dimension groups' could be encapsulated individually in views. This can increase the number of views created considerably, though upon case the views could prove to be useful for troubleshooting or reuse.

Moreover, in environments with only a partition the constraints on the respective field could be removed (important decision though), while the value for the retail category could be hardcoded. Even if hardcoding values should be in general avoided, it's needed to compromise when there are more important constraints.

One way to reduce logic's complexity is to create a "base view" for each table in which the needed transformations are made, respectively only the needed columns are used. Reducing the number of columns simplifies the identification of attributes. For example, the base view for dbo.EcoResProduct could be written as follows:

CREATE OR ALTER VIEW [TDM].[vEcoResProduct]
AS
/*
name: Products - master data (base view)
created: 01.04.2021
modified: 01.04.2021
*/
SELECT PRD.DisplayProductNumber ProductNumber
, PRD.ProductType 
, CASE PRD.ProductType
    WHEN 1 THEN 'Product'
    WHEN 2 THEN 'ProductMaster'
    WHEN 3 THEN 'ProductVariant'
  END ProductTypeName
, CASE WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 1 ELSE 2 END ProductSubtype 
, CASE 
    WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 'Item'
    ELSE 'Product master'  
   END ProductSubtypeName
, PRD.RetailSizeGroupId
, PRD.RetailColorGroupId
, PRD.RetailStyleGroupId
, PRD.RetailFlavorGroupId
, PRD.VariantConfigurationTechnology 
, CASE PRD.VariantConfigurationTechnology
    WHEN 0 THEN 'None'
    WHEN 1 THEN 'PredefinedVariants'
    WHEN 2 THEN 'DimensionBased'
    WHEN 3 THEN 'RuleBased'
    WHEN 4 THEN 'ConstraintBased'
  END VariantConfigurationTechnologyName
, PRD.IsProductVariantUnitConversionEnabled
, CASE PRD.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName
-- system
, PRD.RecId 
, PRD.Partition 
, PRD.ModifiedBy
FROM dbo.EcoResProduct PRD
WHERE NOT(PRD.InstanceRelationType  =  4211)
GO

A similar approach can be used for each table, or at least the tables that need further transformations. There are several choices - ideally one should find a good-enough approach and stick to it. There will be also exceptions, though the general design should prevail!

The bottom line, when using the Export to Data Lake add-in (*), as only the export of tables is supported, the logic can become occasionally complex, though still manageable (in other areas the tables are straightforward to use). When using Link to Data Lake, one can rely on data entities, as long they are synchronized correctly. However, one still needs to rely on tables to fill the gaps.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Happy coding!

Previous Post << ||>> Next Post

SQL Troubles

Pages

21 October 2023

Graphical Representation: Overreaching in Data Visualizations

Data Warehousing: Building a Data Lakehouse for Dynamics 365 Environments with Serverless SQL Pool

Data Warehousing: Dynamics 365, the Data Lakehouse and the Medallion Architecture

20 October 2023

SQL Reloaded: Extended LTrim/RTrim in SQL Server 2022 (Before and After)

19 October 2023

Graphical Representation: Graphics We Live By II (Discount Rates)

14 October 2023

Azure Tools & Services: The Azure Diagrams Architecture Advisor (Test Drive)

Graphical Representation: On Insights II (The Complexity Perspective)

Graphical Representation: On Insights I ('Aha' Moments)

10 October 2023

Base Enums Metadata in Dynamics 365 for Finance and Operations

Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I

SSRS (& Paginated Reports): Products Master Report in Dynamics 365 for Finance & Operations via Base Tables

About Me