Graphical Representation Series |
A Software Engineer and data professional's blog on SQL, data, databases, data architectures, data management, programming, Software Engineering, Project Management, ERP implementation and other IT related topics.
Pages
- 🏠Home
- 🗃️Posts
- 🗃️Definitions
- 🏭Fabric
- ⚡Power BI
- 🔢SQL Server
- 📚Data
- 📚Engineering
- 📚Management
- 📚SQL Server
- 📚Systems Thinking
- ✂...Quotes
- 🧾D365: GL
- 💸D365: AP
- 💰D365: AR
- 👥D365: HR
- ⛓️D365: SCM
- 🔤Acronyms
- 🪢Experts
- 🗃️Quotes
- 🔠Dataviz
- 🔠D365
- 🔠Fabric
- 🔠Engineering
- 🔠Management
- 🔡Glossary
- 🌐Resources
- 🏺Dataviz
- 🗺️Social
- 📅Events
- ℹ️ About
21 October 2023
📊Graphical Representation: Overreaching in Data Visualizations
🧊💫Data Warehousing: Architecture (Part VI: Building a Data Lakehouse for Dynamics 365 Environments with Serverless SQL Pool)
Data Warehousing Series |
One of the major limitations of Microsoft Dynamics 365 is the lack of direct access to the production databases for reporting purposes via standard reporting or ETL/ELT tools. Of course, one can attempt to use OData-based solutions though they don't scale with the data volume and imply further challenges.
At the beginning, Microsoft attempted to address this limitation by allowing the export of data entities into customer's Azure SQL database, feature known as bring your own database (BYOD). Highly dependent on batch jobs, the feature doesn't support real-time synchronization and composite entities, and is dependent on the BYOD's database capacity, the scalability after a certain point becoming a bottleneck.
Then Microsoft started to work on two solutions for synchronizing the Dynamics 365 data in near-real time (cca. 10-30 minutes) to the Data Lake: the Export to Data Lake add-in (*), respectively the Azure Synapse Link for Dataverse with Azure Data Lake. The former allows the synchronization of the tables from Finance & Operations (doesn't work for CRM) to files that reflect the database model from the source. In exchange, the latter allows the synchronization of data entities to similar structures, and probably will support tables as well. Because the service works via Dataverse it supports also the synchronization of CRM data.
The below diagram depicts the flow of data from the D365 environments to the Data Lake, the arrow indicating the direction of the flow. One arrow could be drawn between Dynamics 365 Finance & Operations and the Azure Link for Datavetse service, though one may choose to use only the Export to Data Lake add-in given that a data model based on the tables offers more flexibility (in the detriment of effort though). Data from other systems can be exported via pipelines to the Data Lake to provide an integrated and complete view of the business.
Serverless Data Lakehouse |
Once the data available in the Delta Lake, it can be consumed directly by standard and Power BI paginated reports, however building the data model will involve considerable effort and logic duplication between reports. Therefore, it makes sense to prepare the data upfront in the Data Lake, when possible, and here the serverless SQL pool can help building an enterprise data model. The former approach can still be used for rapid prototyping or data discovery.
The serverless SQL Server pool is a stateless SQL-based distributed data processing query service over Azure data lake for large-scale data and computational functions. Even if it doesn't support standard tables, it allows to make the data from the Data Lakes files available for processing via external tables, a mechanism that maps files' structure to an entity that can be queried like a normal view (though it supports only read operations).
Further on, the enterprise data model can be built like in a normal Data Warehouse via the supported objects
If further performance is needed, parts of the logic or the end-result can be exported to the Data Lake and here the Medallion Architecture should be considered when appropriate. Upon case, further processing might be needed to handle the limitations of the serverless SQL pool (e.g.flattening hierarchies, handling data quality issues).
One can go around the lack of standard table support needed especially for value mappings by storing the respective data as files and/or occasionally by misusing views, respectively by generating Spark tables via the Spark pool.
Note:
(*) Existing
customers have until 1-Nov-2024 to transition from Export to Data lake to
Synapse link. Microsoft advises new customers to use Synapse Link.
Previous Post <<||>> Next Post
🧊Data Warehousing: Architecture V (Dynamics 365, the Data Lakehouse and the Medallion Architecture)
Data Warehousing Series |
An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well!
In Data
Warehousing the medallion architecture considers a multilayered
approach in building a single source of truth, each layer denoting the quality of
data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for
raw data, silver for validated data, and gold for enriched data. The concept seems
sound considering that a Data Lake contains all types of raw data of different quality that
needs to be validated and prepared for reporting or other purposes.
On the
other side there are systems like Dynamics 365 that synchronize the data in
near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of
data entities as views on top of other tables or views). The databases behind
are relational and in theory the data should be of proper quality as needed by business.
The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.
As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy.
Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches.
Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.
Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!
Bottom line: consider the architecture against your requirements!
Previous Post <<||>>> Next Post
[1] What is
the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion
20 October 2023
💎🏭SQL Reloaded: Extended LTrim/RTrim in SQL Server 2022 (Before and After)
In SQL Server 2022, the behavior of LTrim (left trimming) and RTrim (right trimming) functions was extended with one more string parameter. When provided, the engine checks whether the first parameter starts (for LTrim), respectively ends (for RTrim) with the respective value and removes it, the same as the space character char(32) was removed previously:
-- prior behavior of LTrim/RTrim DECLARE @text as nvarchar(50) = ' 123 ' SELECT '(' + LTrim(@text) + ')' LeftTrimming , '(' + RTrim(@text) + ')' RightTrimming , '(' + Ltrim(RTrim(@text)) + ')' Trimming1 -- prior SQL Server 2017 , '(' + Trim(@text) + ')' Trimming2 -- starting with SQL 2017
LeftTrimming | RightTrimming | Trimming1 | Trimming2 |
(123 ) | ( 123) | (123) | (123) |
Here's the new behavior:
-- extended behavior of LTrim/LTrim (SQL Server 2022+) DECLARE @text as nvarchar(50) = '123abc123abc' SELECT LTrim(@text , '123') LeftTrimming , RTrim(@text , 'abc') RightTrimming;
LeftTrimming | RightTrimming |
abc123abc | 123abc123 |
Previously, to obtain the same result one could write something like:
-- prior solution via Left/Right for the same (SQL Server 2000+) DECLARE @text as nvarchar(50) = '123abc123abc' SELECT CASE WHEN Left(@text, 3) = '123' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming , CASE WHEN Right(@text, 3) = 'abc' THEN Left(@text,Len(@text)-3) ELSE @text END RightTrimming -- prior solution via "LIKE" for the same (SQL Server 2000+) DECLARE @text as nvarchar(50) = '123abc123abc' SELECT CASE WHEN @text LIKE '123%' THEN Right(@text,Len(@text)-3) ELSE @text END LeftTrimming , CASE WHEN @text LIKE '%abc' THEN Left(@text,Len(@text)-3) ELSE @text END RightTrimming
As can be seen, the syntax is considerable simplified. However, there are few the situations when is needed. In the past I had to write code to remove parenthesis, quotes or similar characters:
-- removing parantheses DECLARE @text as nvarchar(50) = '(testing)' SELECT LTrim(@text , '(') LeftTrimming , RTrim(@text , ')') RightTrimming , RTrim(LTrim(Trim(@text), '('), ')') Trimming -- removing double quotes DECLARE @text as nvarchar(50) = '"testing"' SELECT LTrim(@text , '"') LeftTrimming , RTrim(@text , '"') RightTrimming , RTrim(LTrim(Trim(@text), '"'), '"') Trimming
The Trim for the 3rd value in both queries was used to remove the eventual spaces before the character to be replaced:
-- removing paranteses with lead/end spaces SELECT RTrim(LTrim(Trim(' (testing) '), '('), ')');
Then I thought, maybe I could use the same to remove the tags from an XML element. I tried the following code and unfortunately it doesn't seem to work:
-- attempting to remove the start/end tags from xml elements DECLARE @text as nvarchar(50) = '<string>testing</string>' SELECT LTrim(@text , '<string>') LeftTrimming , RTrim(@text , '</string>') RightTrimming , RTrim(LTrim(Trim(@text), '<string>'), '</string>') Trimming
LeftTrimming | RightTrimming | Trimming |
esting</string> | <string>te | e |
That's quite an unpleasant surprise! In exchange, the value type can be defined as XML and use the following code to obtain the needed result:
-- extracting the value from a tag element DECLARE @text XML = '<string>testing</string>' SELECT @text.query('data(/string)') as value
19 October 2023
📊Graphical Representation: Graphics We Live By II (Discount Rates in MS Excel)
Graphical Representation Series |
It's difficult, if not impossible, to give general rules on how data visualizations should be built. However, the data professional can use a set of principles, which are less strict than rules, and validate one's work against them. Even then one might need to make concessions and go against the principles or common sense, though such cases should be few, at least in theory. One of such important principles is reflected in Tufte's statement that "if the statistics are boring, then you've got the wrong numbers" [1].
So, the numbers we show should not be boring, but that's the case with most of the numbers we need to show, or we consume in news and other media sources. Unfortunately, in many cases we need to go with the numbers we have and find a way to represent them, ideally by facilitating the reader to make sense of the respective data. This should be our first goal when visualizing data. Secondly, because everybody talks about insights nowadays, one should identify the opportunity for providing views into the data that go beyond the simple visualization, even if this puts more burden on data professional's shoulder. Frankly, from making sense of a set of data and facilitating an 'Aha' moment is a long leap. Thirdly, one should find or use the proper tools and channels for communicating the findings.
A basic requirement for the data professional to be able to address these goals is to have clear definitions of the numbers, have a good understanding of how the numbers reflect the reality, respectively how the numbers can be put into the broader context. Unfortunately, all these assumptions seem to be a luxury. On the other side, the type of data we work with allows us to address at least the first goal. Upon case, our previous experience can help, though there will be also cases in which we can try to do our best.
Let's consider a simple set of data retrieved recently from another post - Discount rates (in percentage) per State, in which the values for 5 neighboring States are considered (see the first two columns from diagram A). Without knowing the meaning of data, one could easily create a chart in Excel or any other visualization tool. Because the State has categorical values, probably some visualization tools will suggest using bar and not column charts. Either by own choice or by using the default settings, the data professional might end up with a column chart (see diagram B), which is Ok for some visualizations.
One can start with a few related questions:
In these charts, the author signalized in titles that red denotes the lowest value, though it just reduces the confusion. One can meet titles in which several colors are used, reminding of a Christmas tree. Frankly, this type of encoding is not esthetically pleasing, and it can annoy the reader.
(6) What's in a name?
The titles and, upon case, the subtitles are important elements in communicating what the data reflects. The title should be in general short and succinct in the information it conveys, having the role of introducing, respectively identifying the chart, especially when multiple charts are used. Some charts can also use a subtitle, which can be longer than the title and have more of a storytelling character by highlighting the message and/or the finding in the data. In diagrams C and D the subtitles were considered as tiles, which is not considerably wrong.
In the media and presentations with influencing character, subtitles help the user understand the message or the main findings, though it's not appropriate for hardcoding the same in dynamic dashboards. Even if a logic was identified to handle the various scenarios, this shifts users' attention, and the chance is high that they'll stop further investigating the visualization. A data professional should present the facts with minimal interference in how the audience and/or users perceive the data.
As a recommendation, one should aim for clear general titles and avoid transmitting own message in charts. As a principle this can be summarized as "aim for clarity and equidistance".
(7) What about meaning?
Until now we barely considered the meaning of data. Unfortunately, there's no information about what the Discount rate means. It could be "the minimum interest rate set by the US Federal Reserve (and some other national banks) for lending to other banks" or "a rate used for discounting bills of exchange", to use the definitions given by the Oxford dictionary. Searching on the web, the results lead to discount rates for royalty savings, resident tuitions, or retail for discount transactions. Most probably the Discount rates from the data set refer to the latter.
We need a definition of the Discount rate to understand what the values represent when they are ordered. For example, when Texas has a value of 25% (see B), does this value have a negative or a positive impact when compared with other values? It depends on how it's used in the associated formula. The last two charts consider that the minimum value has a negative impact, though without more information the encoding might be wrong!
Important formulas and definitions should be considered as side information in the visualization, accompanying text or documentation! If further resources are required for understanding the data, then links to the required resources should be provided as well. At least this assures that the reader can acquire the right information without major overhead.
(8) What do readers look for?
Frankly, this should have been the first question! Readers have different expectations from data visualizations. First of all, it's the curiosity - how the data look in row and/or aggregated form, or in more advanced form how are they shaped (e.g. statistical characteristics like dispersion, variance, outliers). Secondly, readers look in the first phase to understand mainly whether the "results" are good or bad, even if there are many shades of grey in between. Further on, there must be made distinction between readers who want to learn more about the data, models, and processes behind, respectively readers who just want a confirmation of their expectations, opinions and beliefs (aka bias). And, in the end, there are also people who are not interested in the data and what it tells, where the title and/or subtitle provide enough information.
Besides this there are further categories of readers segmented by their role in the decision making, the planning and execution of operational, tactical, or strategic activities. Each of these categories has different needs. However, this exceeds the scope of our analysis.
Returning to our example, one can expect that the average reader will try to identify the smallest and highest Discount rates from the data set, respectively try to compare the values between the different States. Sorting the data and having the values close to each other facilitates the comparison and ranking, otherwise the reader needing to do this by himself/herself. This latter aspect and the fact that bar charts better handle the display of categorical data such as length and number, make from bar charts the tool of choice (see diagram E). So, whenever you see categorical data, consider using a bar chart!
Despite sorting the data, the reader might still need to subtract the various values to identify and compare the differences. The higher the differences between the values, the more complex these operations become. Diagram F is supposed to help in this area, the comparison to the minimal value being shown in orange. Unfortunately, small variances make numbers' display more challenging especially when the visualization tools don't offer display alternatives.
For showing the data from Diagram F were added in the table the third and fourth columns (see diagram A). There's a fifth column which designates the percentage from a percentage (what's the increase in percentages between the current and minimal value). Even if that's mathematically possible, the gain from using such data is neglectable and can create confusion. This opens the door for another principle that applies in other areas as well: "just because you can, it doesn't mean you should!". One should weigh design decisions against common sense or one's intuition on how something can be (mis)used and/or (mis)understood!
The downside of Diagram F is that the comparisons are made only in relation to the minimum value. The variations are small and allow further comparisons. The higher the differences, the more challenging it becomes to make further comparisons. A matrix display (see diagram G) which compares any two values will help if the number of points is manageable. The upper side of the numbers situated on and above the main diagonal were grayed (and can be removed) because they are either nonmeaningful, or the negatives of the numbers found below the diagram. Such diagrams are seldom used, though upon case they prove to be useful.
Choropleth maps (diagram H) are met almost everywhere data have a geographical dimension. Like all the other visuals they have their own advantages (e.g. relative location on the map) and disadvantages (e.g. encoding or displaying data). The diagram shows only the regions with data (remember the data-to-ink ratio principle).
(9) How about the shape of data?
When dealing with numerical data series, it's useful to show aggregated summaries like the average, quartiles, or standard deviation to understand how the data are shaped. Such summaries don't really make sense for our data set given the nature of the numbers (five values with small variance). One can still calculate them and show them in a box plot, though the benefit is neglectable.
(10) Which chart should be used?
As mentioned above, each chart has advantages and disadvantages. Given the simplicity and the number of data points, any of the above diagrams will do. A table is simple enough despite not using any visualization effects. Also, the bar charts are simple enough to use, with a plus maybe for diagram F which shows a further dimension of the data. The choropleth map adds the geographical dimension, which could be important for some readers. The matrix table is more appropriate for technical readers and involves more effort to understand, at least at first sight, though the learning curve is small. The column charts were considered only for exemplification purposes, though they might work as well.
In the end one should go with own experience and consider the audience and the communication channels used. One can also choose 2 different diagrams, especially when they are complementary and offer an additional dimension (e.g. diagrams F and H), though the context may dictate whether their use is appropriate or not. The diagrams should be simple to read and understand, but this doesn't mean that one should stick to the standard visuals. The data professional should explore other means of representing the data, a fresh view having the opportunity of catching the reader's attention.
As a closing remark, nowadays data visualization tools allow building such diagrams without much effort. Conversely, it takes more effort to go beyond the basic functionality and provide more value for thyself and the readers. One should be able to evaluate upfront how much time it makes sense to invest. Hopefully, the few methods, principles and recommendations presented here will help further!
Previous Post <<||>> Next Post
Resources:
[1] Edward R Tufte (1983) "The Visual Display of Quantitative Information"
14 October 2023
💠Azure Tools & Services: The Azure Diagrams Architecture Advisor (Test Drive)
There seems to be a new tool available for the Azure community, even if the tool doesn't belong yet to the Microsoft technology stack. I'm talking about Azure Diagrams Architecture Advisor, which enables users to "design diagrams in a collaborative manner and provides guidance on which services can be integrated and when they should be utilized in your architecture".
I took the tool for a test drive, trying to adapt an architecture I used recently for a Synapse Data Lakehouse. I used the Data Warehouse using Synapse Analytics example as source of inspiration. (You can use the links to check the latest versions.)
Overall, it was a good experience, the functionality being straightforward to use. I needed some time to identify what resources are available, and there are a few. Even if some of the resources were missing (e.g. Dataverse and Dynamics 365 services), I could use a custom resource to fill the gap.
One can easily drag and drop resources (see the left panel) on the canvas and create links between them. The diagram can be saved only as soon as you give it a name and sign in for an account, the email and a password are all you need for that. The diagram can be kept hidden or made available for the whole community, respectively can be saved locally as PNG or SVG files.
Serverless Data Lakehouse |
Each standard (non-custom) resource has a link to documentation, respectively to the pricing. The arrows between resources show whether the integration is supported by default. One can create bidirectional arrows by creating a second arrow from destination to source.
The functionality is pretty basic, though I was able to generate the diagram I wanted, and I'm satisfied with the result. Unfortunately, I couldn't modify directly the example I used for inspiration. It would be great if further functionality like the one available in Visio Online could be made available (e.g. automatic alignment, copy pasting whole diagrams or parts of them, making each value editable, grouping several resources together, adding more graphical content, etc.).
The resources and icons available focus on the Microsoft Azure technology stack. It would be great if one could import at least new icons, though ideally further resources from other important software vendors should be supported (e.g. AWS, Oracle, Adobe, etc.).
The possibility to define further links (e.g. to the actual environments) or properties (e.g. environment names), respectively to change the views for the whole diagram could enhance the usability of such diagrams.
Microsoft should think seriously about including this tool in their stack, if not in Visio Online at least as standalone product!
Besides the examples and the shared diagrams available you can browse also through the collection of Azure Architectures.
To summarize, here are the improvements suggested:
- save and modify directly the examples;
- automatic alignment of resources within the canvas;
- copy pasting whole diagrams or parts of them;
- make each value editable;
- group several resources together (e.g. in a layer);
- adding more graphical content;
- import new icons;
- support resources from other important software vendors (e.g. AWS, Oracle, Adobe, etc.);
- define further links (e.g. to the actual environments);
- define further properties (e.g. environment names);
- change the views for the whole diagram;
- integration into Visio Online.
🧭Business Intelligence: Perspectives (Part VIII: Insights - The Complexity Perspective)
Scientists attempt to discover laws and principles, and for this they conduct experiments, build theories and models rooted in the data they collect. In the business setup, data professionals analyze the data for identifying patterns, trends, outliers or anything else that can lead to new information or knowledge. On one side scientists chose the boundaries of the systems they study, while for data professionals even if the systems are usually given, they can make similar choices.
In theory, scientists are more flexible in what data they collect, though they might have constraints imposed by the boundaries of their experiments and the tools they use. For data professionals most of the data they need is already there, in the systems the business uses, though the constraints reside in the intrinsic and extrinsic quality of the data, whether the data are fit for the purpose. Both parties need to work around limitations, or attempt to improve the experiments, respectively the systems.
Even if the data might have different characteristics, this doesn't mean that the methods applied by data professionals can't be used by scientists and vice-versa. The closer data professionals move from Data Analytics to Data Science, the higher the overlap between the business and scientific setup.
Conversely, the problems data professionals meet have different characteristics. Scientists outlook is directed mainly at the phenomena and processes occurring in nature and society, where randomness, emergence and chaos seem to feel at home. Business processes deal more with predefined controlled structures, cyclicity, higher dependency between processes, feedback and delays. Even if the problems may seem to be different, they can be modeled with systems dynamics.
Returning to data visualization and the problem of insight, there are multiple questions. Can we use simple designs or characterizations to find the answer to complex problems? Which must be the characteristics of a piece of information or knowledge to generate insight? How can a simple visualization generate an insight moment?
Appealing to complexity theory, there are several general approaches in handling complexity. One approach resides in answering complexity with complexity. This means building complex data visualizations that attempt to model problem's complexity. For example, this could be done by building a complex model that reflects the problem studied, and build a set of complex visualizations that reflect the different important facets. Many data professionals advise against this approach as it goes against the simplicity principle. On the other hand, starting with something complex and removing the nonessential can prove to be an approachable strategy, even if it involves more effort.
Another approach resides in reducing the complexity of the problem either by relaxing the constraints, or by breaking the problem into simple problems and addressing each one of them with visualizations. Relaxing the constraints allow studying upon case a more general problem or a linearization of the initial problem. Breaking down the problem into problems that can be easier solved, can help to better understand the general problem though we might lose the sight of emergence and other behavior that characterize complex systems.
Providing simple visualizations to complex problems implies a good understanding of the problem, its solution(s) and the overall context, which frankly is harder to achieve the more complex a problem is. For its understanding a problem requires a minimum of knowledge that needs to be reflected in the visualization(s). Even if some important aspects are assumed as known, they still need to be confirmed by the visualizations, otherwise any deviation from assumptions can lead to a new problem. Therefore, its questionable that simple visualizations can address the complexity of the problems in a general manner.
Previous Post <<||>> Next Post
🧭Business Intelligence: Perspectives (Part VII: Insights - Aha' Moments)
Business Intelligence Series |
One probably can't deny the potentiality of tools or toolsets like data visualization or data storytelling in providing new information or knowledge that leads to insights, though between potential usefulness and harnessing that potential on a general basis there's a huge difference, no matter how much people tend to idealize the process (and there's lot of idealization going on). Moreover, sometimes the whole process seems to look like a black box in which some magic happens and insight happens.
It's challenging to explain the gap as long as there's no generally accepted scientific definition of insights, respectively an explanation of how insights come into being. Probably, the easiest way to recognize their occurrence is when an 'Aha' moment appears, though that's the outcome of a process and gives almost no information about the process itself. Thus, insight occurs when knowledge about the business is acquired, knowledge that allows new or better understanding of the data, facts, processes or models involved.
So, there must be new associations that are formed, either derived directly from data or brought to surface by the storytelling process. The latter aspect implies that the storyteller is already in possession of the respective insight(s) or facilitates their discovery without being aware of them. It further implies that the storyteller has a broader understanding of the business than the audience, which is seldom the case, or that the storyteller has a broader understanding of the data and the information extracted from the data, and that's a reasonable expectation.
There're two important restrictions. First, the insight moments must be associated with the business context rather than with the mere use of tools! Secondly, there should be genuine knowledge, not knowledge that the average person should know, respectively the mere confirmation of expectations or bias.
Understanding can be put in the context of decision making, respectively in the broader context of problem solving. In the latter, insight involves the transition from not knowing how to solve a problem to the state of knowing how to solve it. So, this could apply in the context of data visualization as well, though there might exist intermediary steps in between. For example, in a first step insights enable us to understand and define the right problem. A further step might involve the recognition of the fact the problem belongs to a broader set of problems that have certain characteristics. Thus, the process might involve a succession of 'Aha' moments. Given the complexity of the problems we deal with in business or social contexts, that's more likely to happen. So, the average person might need several 'Aha' moments - leaps in understanding - before the data can make a difference!
Conversely, new knowledge and understanding obtained over successive steps might not lead to an 'Aha' moment at all. Whether such moments converge or not to 'Aha' moments may rely on the importance of the overall leap, though other factors might be involved as well. In the end, the emergence of new understanding is enough to explain what insights mean. Whether that's enough is a different discussion!
Previous Post <<||>> Next Post
10 October 2023
💫Base Enums Metadata in Dynamics 365 for Finance and Operations
The list of values that don't have their own tables are managed within the application as (Base) Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values.
Fortunately, there's a way to go around this limitation. First, there's the old obsolete AX 2012 documentation for Base Enums, which documents the Names and Descriptions for the Enum Values (e.g., search for CustVendNegInstStatus). As Microsoft stopped maintaining the documentation, the changes made in D365 are not reflected.
Secondly, the mapping between Values and Names is stored in two of the tables available in D365, the following query allowing to retrieve the mapping:
-- Metadata - Base Enum Values SELECT EIT.Id , EIT.Name EnumName , EVT.EnumValue EnumValueId , EVT.Name EnumValueName , ' WHEN ' + Cast(EVT.EnumValue as nvarchar(10)) + ' THEN ''' + EVT.Name + '''' [CASE Statement] FROM dbo.EnumValueTable EVT JOIN dbo.EnumIdTable EIT ON EVT.EnumId = EIT.Id WHERE EIT.Name LIKE 'CustVendNegInstStatus%' --WHERE EVT.Name LIKE '%Redrawn%' ORDER BY EIT.Name , EVT.EnumValue
The [Case Statement] column generates already the statement for the CASE. One can search for the Base Enum, respectively for the value itself.
Unfortunately, the metadata are not complete and, as there's no timestamp, it's not possible to identify what and when changed. Therefore, one needs to regularly check the changes manually, or store a copy of the respective data as baseline, thus being able to compare the latest metadata against the baseline. The solution is not ideal, but it does the trick.
To identify the Base Enum corresponding to a field one can export the script behind a form when the functionality is available. Otherwise, one can use the Table Browser and generate the script from the corresponding source table.
Occasionally, the value stored in the above tables may deviate slightly from the expectation. It's useful then to search only for a substring, use both sources and even search on the internet for more information.
When building reports or a semantic layer on top of D365 data model, for almost all important tables is needed to retrieve such mappings. It's a good idea to encapsulate the mappings in single "base views" and reuse it when needed, avoiding thus to replicate the mappings for each use of the source table (see the logic for TDM.vEcoResProduct).
The data generated by the above script could be dumped to a table and thus the table joined in the queries when the values for a Base Enum are needed, though I don't recommend such an approach because it would increase the complexity of the queries, already complex in certain scenarios. Also the misuse of views for creating pseudoconstants is not recommended.
Happy coding!
💫Data Warehousing and Dynamics 365 for Finance and Operation - A Few Issues to Consider I
Data Warehousing Series |
Introduction
Besides the fact that data professionals don't have direct access to D365 F&O production environments (direct access is available only to sandboxes), which was from the beginning an important constraint imposed by the architecture, there are a few more challenges that need to be addressed when working with the data.
Case Sensitiveness
SQL Server is not case sensitive, therefore, depending on the channel though which the data came, values appear either in upper or lower case, respectively a mixture of both. Even if this isn't an issue in D365, it can become an issue when the data leave the environment. E.g., PowerQuery is case sensitive (while DAX is case insensitive), thus, if a field containing a mix of values participate in a join or aggregation, this will result in unexpected behavior (e.g., duplicates, records ignored). It's primarily the case of the Company (aka DataAreaId) field available in most of the important tables.
The ideal solution would be to make sure that the values are correct by design, however this can't be always enforced. Otherwise, when using the data outside of D365 F&O the solution would be to transform all the values in upper case (or lower case). However, also this step might occur too late. E.g., when the data are exported to the Azure Data Lake in parquet file format.
Unique Keys
A unique record in D365 F&O was in earlier versions usually identified by the RecId and DataAreaId, while later the Partition field was added. This means that most of the joins will need to consider all 3 columns, which adds some overhead. In some environments there's only a Partition defined (and thus the field can be ignored), however this is not a warranty.
As long developers use SQL there's no issue of using multiple fields in JOINs, though in PowerQuery there must be created a unique key based on the respective records so the JOINs are possible. Actually, also SQL-based JOINs would benefit if each record would be identified by one field.
Audit Metadata
Not all tables have fields that designate the date when a record was created or last modified, respectively the user who performed the respective action. The fields can be added manually when setting up the system, however that's seldom done. This makes it difficult to audit the records and sometimes it's a challenge also for reporting, respectively for troubleshooting the differences between DWH and source system. Fortunately, the Export to Data Lake adds a timestamp reflecting the time when the record was synchronized, though it can be used then only for the records synchronized after the first load.
Tables vs. Entities
Data are modified in D365 F&O via a collection of entities, which are nothing but views that encapsulate the business logic, being based on the base tables or other views, respectively a combination of both. The Export to Data Lake (*) is based on the tables, while Link to Data Lake is based on data entities.
Using the base tables means that the developer must reengineer the logic from the views. For some cases it might work to create the entities as views in the DWH environment though some features might not be supported. It's the case of serverless and dedicated SQL pools, that support only a subset from the features available under standard Azure SQL Server.
The developer can try to replicate the logic from entities, considering only the logic needed by the business, especially when only a subset from the functionality available in the entity was used. The newly created views can become thus more readable and maintainable. On the other side, if the logic in entity changed, the changes need to be reflected also in the DWH views.
Using the entity-based data makes sure that the data are consistent between environments. Unfortunately, Microsoft found out that isn't so easy to synchronize the data at entity level. Moreover, there are multiple entities based on the same table that reflect only a subset of the columns or rows. Thus, to cover all the fields from a base table, one might be forced to synchronize multiple views, leading thus to data duplication.
In theory, both mechanisms can be used within the same environment, even if this approach is against the unique source of truth principle, when data are duplicated.
Data Validation in the Data Lake
One scenario in which both sources are useful is when validating whether the synchronization mechanism worked as expected. Thus, one can compare the number of records and check whether there are differences that can't be mitigated. However, does it make sense to "duplicate" database objects only for this purpose?
Ideally, to validate whether a record was synchronized should be done in the source environment (e.g. via a timestamp). That's difficult to achieve, especially when there's no direct access to the source database (as is the case for Production databases). Fortunately, Dataverse provides this functionality, even if might not be bullet proof.
In extremis, the most reliable approach is to copy the production environment on a sandbox and do a count of records for each table, using as baseline for comparison the time when the refresh occurred.
Base Enum Values
The list of values that don't have their own tables are managed within the application as Base Enums and, naturally, only the numeric values being saved to the database. Even if this is practical for the application, it's a nightmare for the people using the data exported from database as is needed to convert the codes to meaningful values. Some of the mappings between the codes and values are documented in two system tables, and even in old language-based documentation, though both sources are far from complete. As alternative, one can try to discover the values in the system.
Unfortunately, the mappings need to be repeated when the Enum-based attributed is used in multiple places. One can reduce mapping's duplication by encapsulating the logic into a view (aka "base view") and reused accordingly (see the logic for TDM.vEcoResProduct).
Even if the values for many of the Enums are stored into the EnumValueTable table, Enum's name being available in EnumIdTable table, it's not a good idea to retrieve the values via a JOIN in the business logic. This would complicate the business logic unnecessarily. A CASE is more efficient even if occasionally more difficult to maintain. Unfortunately, there's no timestamp to identify which values were added lately.
Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.
🪄💫SSRS (& Paginated Reports): Dynamics 365 F&O (Part II: Products Master Report via Base Tables)
As mentioned in the previous post, building the Products Master paginated report via the base D365 FO tables synchronized in Data Lake involves more effort and implies duplicating the logic from the used data entity(-ies). One can attempt recreating the entities 1:1 as they are defined in the D365 database (an SQL Server Azure database), however there are high the chances that in certain areas (e.g. financial dimensions) the features used in building the logic are not supported. Moreover, it might be the case that only a subset of the functionality was implemented, and with this, pieces of logic becoming obsolete.
There's also a limitation on the number of tables which can be synchronized, and therefore one needs to compromise. In the dbo.EcoResProductV2Entity there are about 20 tables involved, though probably only half of them are used. The base table dbo.EcoResProduct contains most of the fields available also in the data entity, though the names for the various attributes need to be retrieved from the various tables.
Some of the issues met in working with data entities appear in this scenario as well: (2) there are fields based on Enums that store only the code and it's needed to map the corresponding values, (4) further transformations are needed (e.g., converting fields to formats). It makes sense to encapsulate similarly the logic in views, though this time one or two layers of logic are needed in addition. One can choose to do this for each table in scope, or replicate data entity's logic by focusing only on the needed parts. Both approaches will prove to be challenging in certain scenarios. Let's consider the latter:
CREATE OR ALTER VIEW TDM.vEcoResProductV2Entity AS /* name: Products - Master (base view) created: 01.04.2021 modified: 01.04.2021 */ SELECT ITM.ProductType , CASE ITM.ProductType WHEN 1 THEN 'Item' WHEN 2 THEN 'Service' END ProductTypeName , CAST ((CASE WHEN ITM.InstanceRelationType = 13678 THEN 1 ELSE 2 END) AS INT) ProductSubtype , CASE WHEN ITM.InstanceRelationType = 13678 THEN 'Product' ELSE 'Product Master' END ProductSubtypeName , ITM.DisplayProductNumber ProductNumber , Replace(Replace(ILT.Name, char(10), ' '), char(13), ' ') ProductName , Replace(Replace(ILT.Description, char(10), ' '), char(13), ' ') ProductDescription , PCA.CategoryName RetailProductCategoryName , PCA.CategoryCode RetailProductCategoryCode , PDG.Name ProductDimensionGroupName , SDG.Name StorageDimensionGroupName , TDG.Name TrackingDimensionGroupName , ITM.RetailColorGroupId ProductColorGroupId , ITM.RetailSizeGroupId ProductSizeGroupId , ITM.RetailStyleGroupId ProductStyleGroupId , ITM.VariantConfigurationTechnology , CASE ITM.VariantConfigurationTechnology WHEN 0 THEN 'None' WHEN 1 THEN 'Predefined Variants' WHEN 2 THEN 'Dimension Based' WHEN 3 THEN 'RuleBased' WHEN 4 THEN 'Constraint Based' END VariantConfigurationTechnologyName , CASE WHEN KIT.ProductMaster IS NOT NULL THEN 1 ELSE 0 END IsProductKit , CASE WHEN KIT.ProductMaster IS NOT NULL THEN 'Yes' ELSE 'No' END IsProductKitName , ITM.PDSCWProduct IsCatchWeightproduct , CASE ITM.PDSCWProduct WHEN 1 THEN 'Yes' ELSE 'No' END IsCatchWeightproductName , ITM.IsProductVariantUnitConversionEnabled , CASE ITM.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName -- system , PPD.ProductDimensionGroup ProductDimensionGroupRecId , PSG.StorageDimensionGroup StorageDimensionGroupRecId , PTD.TrackingDimensionGroup TrackingDimensionGroupRecId , PCA.RetailCategoryRecId , ITM.RecId , ITM.Partition FROM dbo.EcoResProduct ITM LEFT JOIN dbo.EcoResProductTranslation ILT ON ITM.RecId = ILT.Product AND ITM.Partition = ILT.Partition AND ILT.LanguageId = 'en-us' LEFT JOIN dbo.EcoResProductDimensionGroupProduct PPD ON ITM.RecId = PPD.Product AND ITM.Partition = PPD.Partition LEFT JOIN dbo.EcoResProductDimensionGroup PDG ON PPD.ProductDimensionGroup = PDG.RecId AND PPD.Partition = PDG.Partition LEFT JOIN dbo.EcoResStorageDimensionGroupProduct PSG ON ITM.RecId = PSG.Product AND ITM.Partition = PSG.Partition LEFT JOIN dbo.EcoResStorageDimensionGroup SDG ON PSG.StorageDimensionGroup = SDG.RecId AND PSG.Partition = SDG.Partition LEFT JOIN dbo.EcoResTrackingDimensionGroupProduct PTD ON ITM.RecId = PTD.Product AND ITM.Partition = PTD.Partition LEFT JOIN dbo.EcoResTrackingDimensionGroup TDG ON PTD.TrackingDimensionGroup = TDG.RecId AND PTD.Partition = TDG.Partition LEFT JOIN (-- product retail category SELECT PCA.Product , PCA.Partition , CAT.Code CategoryCode , CAT.Name CategoryName , PCA.Category RetailCategoryRecId FROM dbo.EcoResProductCategory PCA JOIN dbo.EcoResProductRetailCategoryHierarchy RCH ON PCA.CategoryHierarchy = RCH.RetailCategoryHierarchy AND PCA.Product = RCH.Product AND PCA.Partition = RCH.Partition JOIN dbo.EcoResCategory CAT ON PCA.Category = CAT.RecId AND PCA.Partition = CAT.Partition ) PCA ON ITM.RecId = PCA.Product AND ITM.Partition = PCA.Partition LEFT JOIN dbo.RetailKit KIT ON ITM.RecId = KIT.ProductMaster AND ITM.Partition = KIT.Partition WHERE NOT(ITM.InstanceRelationType = 4211)
As can be seen, the logic is quite complex and only half of the tables were used. There will be entities even more complex than this (the query was restructured for understandability). On the other hand, there will be environments where only half from the above tables will be used (e.g., when no inventory and/or dimensions are needed).
To test the view, one just needs to change the schema referenced in the logic. The view is created under the TDM (Table Data Model) schema, so there should be no issues as long the schemas are used correctly. However, when duplicating logic, one should check whether the values match, respectively whether the objects have the same record count:
-- checking whether the values match for an example SELECT * FROM TDM.vEcoResProductV2Entity WHERE ProductNumber = '0169' SELECT * FROM EDM.vEcoResProductV2Entity WHERE ProductNumber = '0169' -- checking the number of records SELECT count(*) FROM TDM.vEcoResProductV2Entity SELECT count(*) FROM EDM.vEcoResProductV2Entity
There are several optimizations or restructuring opportunities in the logic, e.g., the 'product retail category' subquery can be created as a separate view. Also, the groupings of two tables for the 'dimension groups' could be encapsulated individually in views. This can increase the number of views created considerably, though upon case the views could prove to be useful for troubleshooting or reuse.
Moreover, in environments with only a partition the constraints on the respective field could be removed (important decision though), while the value for the retail category could be hardcoded. Even if hardcoding values should be in general avoided, it's needed to compromise when there are more important constraints.
One way to reduce logic's complexity is to create a "base view" for each table in which the needed transformations are made, respectively only the needed columns are used. Reducing the number of columns simplifies the identification of attributes. For example, the base view for dbo.EcoResProduct could be written as follows:
CREATE OR ALTER VIEW [TDM].[vEcoResProduct] AS /* name: Products - master data (base view) created: 01.04.2021 modified: 01.04.2021 */ SELECT PRD.DisplayProductNumber ProductNumber , PRD.ProductType , CASE PRD.ProductType WHEN 1 THEN 'Product' WHEN 2 THEN 'ProductMaster' WHEN 3 THEN 'ProductVariant' END ProductTypeName , CASE WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 1 ELSE 2 END ProductSubtype , CASE WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 'Item' ELSE 'Product master' END ProductSubtypeName , PRD.RetailSizeGroupId , PRD.RetailColorGroupId , PRD.RetailStyleGroupId , PRD.RetailFlavorGroupId , PRD.VariantConfigurationTechnology , CASE PRD.VariantConfigurationTechnology WHEN 0 THEN 'None' WHEN 1 THEN 'PredefinedVariants' WHEN 2 THEN 'DimensionBased' WHEN 3 THEN 'RuleBased' WHEN 4 THEN 'ConstraintBased' END VariantConfigurationTechnologyName , PRD.IsProductVariantUnitConversionEnabled , CASE PRD.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName -- system , PRD.RecId , PRD.Partition , PRD.ModifiedBy FROM dbo.EcoResProduct PRD WHERE NOT(PRD.InstanceRelationType = 4211) GO
A similar approach can be used for each table, or at least the tables that need further transformations. There are several choices - ideally one should find a good-enough approach and stick to it. There will be also exceptions, though the general design should prevail!
The bottom line, when using the Export to Data Lake add-in (*), as only the export of tables is supported, the logic can become occasionally complex, though still manageable (in other areas the tables are straightforward to use). When using Link to Data Lake, one can rely on data entities, as long they are synchronized correctly. However, one still needs to rely on tables to fill the gaps.
Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.
Happy coding!
Previous Post << ||>> Next Post
About Me
- Adrian
- Koeln, NRW, Germany
- IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.