Showing posts with label views. Show all posts
Showing posts with label views. Show all posts

07 August 2024

Business Intelligence: Data Modeling (Part II: From Data to Data Models)

Business Intelligence Series
Business Intelligence Series

A data model can be defined as an abstract, self-contained, logical definition of the data structures available in a database or similar repositories. It’s typically an abstraction of the data structures underpinning a set of processes, procedures and business logic used for a predefined purpose. A data model can be formed also of unrelated micromodels, depicting thus various aspects of a business. 

The association between data and data models is bidirectional. Given a set of data, a data model can be built to underpin the respective data. Conversely, one can create or generate data based on a data model. However, in business setups a bidirectional relationship between data and the data model(s) underpinning them is more realistic as the business evolves. In extremis, the data model can be used to reflect a business’ needs, at least when the respective needs are addressed accordingly by extending the data model(s).

Given a set of data (e.g. the data stored in one or more spreadsheets or other type of files) there can be defined in theory multiple data models to reflect the respective data. Within a data model, the fields (aka attributes) are partitioned into a set of data entities, where a data entity is thus a nonunique grouping of attributes that attempt to define together one unitary aspect of the world. Customers, Vendors, Products, Invoices or Sales Orders are examples of such data entities, though entities can have a broader granularity (e.g. Customers can be modeled over several tables like Entity, Addresses, Contact information, etc.). 

From an operational database’s perspective, a data entity is based on one or more tables, though several entities can share some of the tables. From a BI artifact’s perspective, an entity should be easy to create from the underlying tables, with a minimal set of transformations. Ideally, the BI data model should be as close as possible to the needed entity for reporting, however an optimal solution lies usually somewhere in between. In this resides the complexity of modeling BI solutions – providing an optimal data model which can be easily built on the source tables, and which allows addressing all or at least most of the BI requirements.

In other words, we deal with two optimization problems of two distinct data models. On one side the business data model must be flexible enough to provide fast read/write operations while keeping the referential data’s granularity efficient. Conversely, a BI data model needs to abstract these entities and provide a fast way of processing the data, while making data reads extremely efficient. These perspectives must apply when we move to Microsoft Fabric too. 

The operational data layer must provide this abstraction, and in this resides the complexity of building optimal BI solutions. This is the layer at which the modeling problems need to be tackled. The challenge of BI and Analytics resides in finding an optimal data model that allows us to address most or ideally all the BI requirements. Several overlapping layers of abstraction may be built in the process.

Looking at the data modeling techniques used in notebooks and other similar solutions, data modeling has the chance of becoming a redundant practice prone to errors. Moreover, data models have a tendency of being multilayered and of being based on certain perspectives into the processes they model. Providing reliable flexible models involves finding the right view into the data for modeling aspects of the business. Database views allow us to easily model such perspectives, often in a unique way. Moving away from them just shifts the burden on the multiple solutions built around the base data, which can create other important challenges. 

Previous Post <<||>> Next Post

10 April 2024

Business Intelligence: Data Modeling (Part I: Ways of Thinking about Data)

Business Intelligence Series

I observed in several cases the tendency of data professionals to move from a business problem directly to data and data modeling without trying to understand the processes behind the data. One could say that the behavior is driven by the eagerness of exploring the data, though even later there are seldom questions considered about the processes themselves. One can argue that maybe the processes are self-explanatory, though that’s seldom the case. 

Conversely, looking at the datasets available on the web, usually there’s a fact table and the associated dimensions, the data describing only one process. It’s natural to presume that there are data professionals who don’t think much about, or better said in terms of processes. A similar big jump can be observed in blog posts on dashboards and/or reports, bloggers moving from the data directly to the data model. 

In the world of complex systems like Enterprise Resource Planning (ERP) systems thinking in terms of processes is mandatory because a fact table can hold the data for different processes, while processes can span over multiple fact-like tables, and have thus multiple levels of detail. Moreover, processes are broken down into sub-processes and procedures that have a counterpart in the data as well. 

Moreover, within a process there can be multiple perspectives that are usually module or role dependent. A perspective is a role’s orientation to the word for which the data belongs to, and it’s slightly different from what the data professional considers as view, the perspective being a projection over a set of processes within the data, while a view is a projection of the perspectives into the data structure. 

For example, considering the order-to-cash process there are several sub-processes like order fulfillment, invoicing, and payment collection, though there can be several other processes involved like credit management or production and manufacturing. Creating, respectively updating, or canceling an order can be examples of procedures. 

The sales representative, the shop worker and the accountant will have different perspectives projected into the data, focusing on the projection of the data on the modules they work with. Thinking in terms of modules is probably the easiest way to identify the boundaries of the perspectives, though the rules are occasionally more complex than this.

When defining and/or attempting to understand a problem it’s important to understand which perspective needs to be considered. For example, the sales volume can be projected based on Sales orders or on invoiced Sales orders, respectively on the General ledger postings, and the three views can result in different numbers. Moreover, there are partitions within these perspectives based on business rules that determine what to include or exclude from the logic. 

One can define a business rule as a set of conditional logic that constraints some part of the data in the data structures by specifying what is allowed or not, though usually we refer to a special type called selection business rule that determines what data are selected (e.g. open Purchase orders, Products with Inventory, etc.). However, when building the data model we need to consider business rules as well, though we might need to check whether they are enforced as well. 

Moreover, it’s useful to think also in terms of (data) entities and sub-entities, in which the data entity is an abstraction from the physical implementation of database tables. A data entity encapsulates (hides internal details) a business concept and/or perspective into an abstraction (simplified representation) that makes development, integration, and data processing easier. In certain systems like Dynamics 365 is important to think at this level because data entities can simplify data modelling considerably.

Previous Post <<||>> Next Post

17 February 2024

Business Intelligence: Microsoft Fabric's Notebooks

Business Intelligence Series
Business Intelligence Series 

When several technologies make their entrance in a data-related field like Data Warehousing, Data Analitics or Data Science, one is forced to understand how the respective technologies can be used or misused, respectively what's their place in the bigger picture. Microsoft Fabric introduces several important technologies that will change the way data are stored, processed and consumed. 

The first important technology is the notebook - a web document-like cell-based container for writing and executing code in a collaborative manner. The concept is not new, Jupyter notebooks have been around for almost a decade. In Microsof Fabric, notebooks support multiple languages, from which a default one applies to the whole notebook, while on cell level any of the supported languages can be used. 

One can execute a single cell, multiple cells or the entire notebook in a sequential manner, mix languages for the various operations - load, transform, save, and visualize data when needed. Notebooks can be parametrized and run via the homonymous activity in Data Factory pipelines, automating thus data processing. Probably more functionality is to come. 

Data engineers seems to have great flexibility, though usually flexibility implies constraints and/or mischiefs in other areas. I see for example in presentations the overuse of temporary data objects (mainly views) in Spark SQL as part of complex logic. That's acceptable during prototyping, though such code becomes a danger as soon the logic is deployed into production. Data objects should be created outside of the logic that uses them and should be treated as artifacts, with version control and proper documentation. It's maybe true that temporary objects reduce the volume of objects in the metastore, though is this the way to go?

Temporary objects tend to lead to wheel's reinvention or they get duplicated across multiple notebooks, which can easily create a maintenance nightmare. One needs to consider that the business logic changes a lot, the requirements and the data sources change, and on the long term, the cost of maintaining the code can easily overweight the benefits. 

Notebooks remind me of the beginnings of web programming when HTML was mixed up with client scripting languages like VB Script or Javascript, CSS, respectively server-side scripting languages. It was kind of a spaghetti code, modified repeatedly by multiple programmers, unendingly duplicated, and through a miracle it worked, until it stopped working unexpectedly in strangest situations. The strangest part was when after removing  commented code from a section made the code run again. 

The debugging of another person's code was a nightmare. Code developed by two people for similar purposes was looking unrecognizable different in terms of structure, programming techniques and layout. The technical debt was high, increasing in exponential manner. One was aware that the code needed refactoring, though there were more important things to do or no time allocated for it.

In the meantime the maturity of programming languages, frameworks, methodologies, best practices, and hopefully of programmers improved the overall quality of software (at least on average). Thinking of software from an Engineer's perspective improved the efficiency and effectiveness of a programmer's endeavor. The average programmer is able to write quality code, though there's a considerable minimum of "engineering" knowledge involved beside the mere knowledge of languages and tools. 

Notebooks are good up to a point, beyond which one needs to take a step back, restructure, move the code where it belongs, take a few more steps back and review the good practices and their application, disseminate the knowledge inside the team and use it in the next iterations, respectively refractor the code when needed! Hopefully, people learned from the mistakes of the past. 

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link

07 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Views and other Data Objects

One reads in the training material that the SQL Endpoint provides a read-only experience [1], meaning that no data can be written back to the delta lake tables. Playing with the metadata available in Spark SQL via Notebooks and the SQL Endpoint (see post), I realized that there is more to the statement! Even if one can query via the SQL Endpoint only delta tables, this doesn't mean that one can't build a semantic model on top of it, much like one was able to do via the Serverless SQL pool in Azure Synapse.

In Spark one can create via SQL, PySpark and the other supported languages views and functions, though they will not be available to the SQL Endpoint! To use the data generated in the process, the respective data needs to be saved to delta tables. Conversely, one can still create views, functions and stored procedures via the SQL Endpoint though the objects won't be available in Spark SQL! 

This has important implications, though in this post let's focus on the syntax and create several objects for testing purposes in the two environments. I'll use the Assets delta table created in a previous post. The Spark SQL code should be run in a notebook (e.g. one cell per group of statements), while the code for the SQL Endpoint should be run in SQL Server Management Studio.

Views

/* test view's creation in the SQL Endpoint */

-- drop the test view 
DROP VIEW IF EXISTS dbo.vAssets_Microsoft2;
GO

-- create the test view
CREATE VIEW dbo.vAssets_Microsoft2
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = 'Microsoft';
GO

-- test the viwe
SELECT *
FROM dbo.vAssets_Microsoft2;

/* test view's creation in Spark SQL */

-- drop test view 
DROP VIEW IF EXISTS vAssets_Microsoft;

-- create test view
CREATE VIEW vAssets_Microsoft COMMENT 'Microsoft assets in scope (view)'
AS
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM assets
WHERE Vendor = 'Microsoft';

-- review data
SELECT *
FROM vAssets_Microsoft;

Table-Valued Functions

/* test function's creation in the SQL Endpoint */

-- drop the test function 
DROP FUNCTION IF EXISTS dbo.fAssets_Microsoft2;
GO

-- create the test function
CREATE FUNCTION dbo.fAssets_Microsoft2(
    @Vendor nvarchar(max))
RETURNS TABLE 
AS 
RETURN (
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM dbo.Assets
    WHERE Vendor = @Vendor
);
GO

-- test the function
SELECT *
FROM dbo.fAssets_Microsoft2('Microsoft');

SELECT *
FROM dbo.fAssets_Microsoft2('Dell');

Unfortunately, the Spark SQL code doesn't seem to work, its execution returning a PARSE_SYNTAX_ERROR error no matter how simple the code was (see also [2]).
 
/* test function's creation in Spark SQL */

-- drop test function 
DROP FUNCTION IF EXISTS fAssets_Microsoft;

-- create test function
CREATE FUNCTION fAssets_Microsoft(
    pVendor string)
RETURNS TABLE
AS 
RETURN 
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM assets
    WHERE Vendor = pVendor;

-- review data
SELECT *
FROM fAssets_Microsoft('Microsoft');

Stored Procedure

Stored procedures aren't available in Spark SQL, though this doesn't mean that we can't test the code in the SQL Endpoint:

/* test procedure's creation in the SQL Endpoint */

-- drop the test procedure 
DROP PROCEDURE IF EXISTS dbo.spAssets_Microsoft2;
GO

-- create the test procedure
CREATE PROCEDURE dbo.spAssets_Microsoft2(
@Vendor nvarchar(max) = NULL)
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = IsNull(@Vendor, Vendor);
GO

-- test the procedure
EXEC dbo.spAssets_Microsoft2 'Microsoft';
EXEC dbo.spAssets_Microsoft2 'Dell';
EXEC dbo.spAssets_Microsoft2;

Notes:
1) I observed in documentation and some presentations that the common practice of prefixing data objects based on their type is seldom considered. I still find it useful when building solutions, even if object's type can be derived from the context and/or metadata. 
2) The examples were chosen to test the minimal functionality so that the differences between the two platforms are minimal - using the dbo schema and the GO command in the SQL Endpoint, COMMENT in Spark SQL. However, as soon specific functionality is used, extra code is needed to mitigate the differences.
3) The names between environments were kept different, just in case one needs to test objects' availability between platforms.

Happy coding!

Resources:
[1] Microsoft Learn (2023) Work with Delta Lake tables in Microsoft Fabric (link)
[2] Databricks (2023) CREATE FUNCTION (SQL and Python) (link)

06 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

There seem to be four main sources for learning about the functionality available in the Delta Lake especially in what concerns the SQL dialect used by the Spark SQL: Databricks [1], Delta Lake [2], Azure Databricks [3], respectively the Data Engineering documentation in Microsoft Fabric [4] and the afferent certification material. Unfortunately, the latter focuses more on PySpark. So, until Microsoft addresses the gap, one can consult the other sources, check what's working and built thus the required knowledge for handling the various tasks. 

First of all, it's important to understand which the data objects available in Microsoft Fabric are. Based on [5] I could identify the following hierarchy:


According to the same source [5] the metastore contains all of the metadata that defines data objects in the lakehouse, while the catalog is the highest abstraction in the lakehouse. The database, called also a schema, keeps its standard definition - a collection of data objects, such as tables or views (aka relations) and functions. The tables, views and functions keep their standard definitions. Except the metastore, these are also the (securable) objects on which permissions can be set. 

One can explore the structure in Spark SQL by using the SHOW command:

-- explore the data objects from the lakehouse
SHOW CATALOGS;

SHOW DATABASES;

SHOW SCHEMAS;

SHOW CATALOGS;

SHOW VIEWS;

SHOW TABLES;

SHOW FUNCTIONS;

Moreover, one can list only the objects from an object from the parent (e.g. the tables existing in a database):
 
-- all tables from a database
SHOW TABLES FROM Testing;

-- all tables from a database matching a pattern
SHOW TABLES FROM Testing LIKE 'Asset*';

-- all tables from a database matching multiple patterns
SHOW TABLES FROM Testing LIKE 'Asset*|cit*';

Notes:
1) Same syntax applies for views and functions, respectively for the other objects in respect to their parents (from the hierarchy). 
2) Instead of FROM one can use the IN keyword, though I'm not sure what's the difference given that they seem to return same results.
3) For databases and tables one can use also the SQL Server to export the related metadata. Unfortunately, it's not the case for views and functions because when the respective objects are created in Spark SQL, they aren't visible over the SQL Endpoint, and vice versa.

4) Given that there are multiple environments that deal with delta tables, to minimize the confusion that might result from their use, it makes sense to use database as term instead of schema. 

References:
[1] Databricks (2024) SQL language reference (link)
[2] Delta Lake (2023) Delta Lake documentation (link)
[3] Microsoft Learn (2023) Azure Databricks documentation (link)
[4] Microsoft Learn (2023) Data Engineering documentation in Microsoft Fabric (link)
[5] Databricks (2023) Data objects in the Databricks lakehouse (link)

10 October 2023

SSRS (& Paginated Reports): Products Master Report in Dynamics 365 for Finance & Operations via Base Tables

As mentioned in the previous post, building the Products Master paginated report via the base D365 FO tables synchronized in Data Lake involves more effort and implies duplicating the logic from the used data entity(-ies). One can attempt recreating the entities 1:1 as they are defined in the D365 database (an SQL Server Azure database), however there are high the chances that in certain areas (e.g. financial dimensions) the features used in building the logic are not supported. Moreover, it might be the case that only a subset of the functionality was implemented, and with this, pieces of logic becoming obsolete. 

There's also a limitation on the number of tables which can be synchronized, and therefore one needs to compromise. In the dbo.EcoResProductV2Entity there are about 20 tables involved, though probably only half of them are used. The base table dbo.EcoResProduct contains most of the fields available also in the data entity, though the names for the various attributes need to be retrieved from the various tables. 

Some of the issues met in working with data entities appear in this scenario as well: (2) there are fields based on Enums that store only the code and it's needed to map the corresponding values, (4) further transformations are needed (e.g., converting fields to formats). It makes sense to encapsulate similarly the logic in views, though this time one or two layers of logic are needed in addition. One can choose to do this for each table in scope, or replicate data entity's logic by focusing only on the needed parts. Both approaches will prove to be challenging in certain scenarios. Let's consider the latter:

CREATE OR ALTER VIEW TDM.vEcoResProductV2Entity
AS 
/*
name:  Products - Master (base view)
created: 01.04.2021
modified: 01.04.2021
*/
SELECT ITM.ProductType
, CASE ITM.ProductType
    WHEN 1 THEN 'Item'
	WHEN 2 THEN 'Service'
  END ProductTypeName
, CAST ((CASE WHEN ITM.InstanceRelationType = 13678 THEN 1 ELSE 2 END) AS INT) ProductSubtype
, CASE 
    WHEN ITM.InstanceRelationType = 13678 THEN 'Product'
	ELSE 'Product Master'
   END ProductSubtypeName
, ITM.DisplayProductNumber ProductNumber
, Replace(Replace(ILT.Name, char(10), ' '), char(13), ' ')  ProductName
, Replace(Replace(ILT.Description, char(10), ' '), char(13), ' ') ProductDescription
, PCA.CategoryName RetailProductCategoryName 
, PCA.CategoryCode RetailProductCategoryCode
, PDG.Name ProductDimensionGroupName 
, SDG.Name StorageDimensionGroupName 
, TDG.Name TrackingDimensionGroupName 
, ITM.RetailColorGroupId ProductColorGroupId 
, ITM.RetailSizeGroupId ProductSizeGroupId 
, ITM.RetailStyleGroupId ProductStyleGroupId
, ITM.VariantConfigurationTechnology
, CASE ITM.VariantConfigurationTechnology
	WHEN 0 THEN 'None'
	WHEN 1 THEN 'Predefined Variants'
    WHEN 2 THEN 'Dimension Based'
	WHEN 3 THEN 'RuleBased'
	WHEN 4 THEN 'Constraint Based'
  END VariantConfigurationTechnologyName
, CASE WHEN KIT.ProductMaster IS NOT NULL THEN 1 ELSE 0 END IsProductKit
, CASE WHEN KIT.ProductMaster IS NOT NULL  THEN 'Yes' ELSE 'No' END IsProductKitName
, ITM.PDSCWProduct IsCatchWeightproduct
, CASE ITM.PDSCWProduct WHEN 1 THEN 'Yes' ELSE 'No' END IsCatchWeightproductName
, ITM.IsProductVariantUnitConversionEnabled
, CASE ITM.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName
-- system
, PPD.ProductDimensionGroup ProductDimensionGroupRecId
, PSG.StorageDimensionGroup StorageDimensionGroupRecId
, PTD.TrackingDimensionGroup TrackingDimensionGroupRecId
, PCA.RetailCategoryRecId
, ITM.RecId
, ITM.Partition
FROM dbo.EcoResProduct ITM
     LEFT JOIN dbo.EcoResProductTranslation ILT
	   ON ITM.RecId = ILT.Product 
	  AND ITM.Partition = ILT.Partition 
	  AND ILT.LanguageId = 'en-us'
     LEFT JOIN dbo.EcoResProductDimensionGroupProduct PPD
	   ON ITM.RecId = PPD.Product
	  AND ITM.Partition = PPD.Partition
	      LEFT JOIN dbo.EcoResProductDimensionGroup PDG
		    ON PPD.ProductDimensionGroup = PDG.RecId 
		   AND PPD.Partition = PDG.Partition 
	 LEFT JOIN dbo.EcoResStorageDimensionGroupProduct PSG
	   ON ITM.RecId = PSG.Product
	  AND ITM.Partition = PSG.Partition
	      LEFT JOIN dbo.EcoResStorageDimensionGroup SDG
		    ON PSG.StorageDimensionGroup = SDG.RecId 
		   AND PSG.Partition = SDG.Partition 
	 LEFT JOIN dbo.EcoResTrackingDimensionGroupProduct PTD
	   ON ITM.RecId = PTD.Product
	  AND ITM.Partition = PTD.Partition
	      LEFT JOIN dbo.EcoResTrackingDimensionGroup TDG
		    ON PTD.TrackingDimensionGroup = TDG.RecId 
		   AND PTD.Partition = TDG.Partition 
	 LEFT JOIN (-- product retail category 
	   SELECT PCA.Product 
	   , PCA.Partition 
	   , CAT.Code CategoryCode 
	   , CAT.Name CategoryName
	   , PCA.Category RetailCategoryRecId
	   FROM dbo.EcoResProductCategory PCA
	        JOIN dbo.EcoResProductRetailCategoryHierarchy RCH
			  ON PCA.CategoryHierarchy = RCH.RetailCategoryHierarchy
			 AND PCA.Product = RCH.Product 
			 AND PCA.Partition = RCH.Partition 
			JOIN dbo.EcoResCategory CAT
			  ON PCA.Category = CAT.RecId 
			 AND PCA.Partition = CAT.Partition 
		 ) PCA
	   ON ITM.RecId = PCA.Product 
	  AND ITM.Partition = PCA.Partition 
	  LEFT JOIN dbo.RetailKit KIT
	    ON ITM.RecId = KIT.ProductMaster 
	   AND ITM.Partition = KIT.Partition
WHERE NOT(ITM.InstanceRelationType  =  4211)

As can be seen, the logic is quite complex and only half of the tables were used. There will be entities even more complex than this (the query was restructured for understandability). On the other hand, there will be environments where only half from the above tables will be used (e.g., when no inventory and/or dimensions are needed). 

To test the view, one just needs to change the schema referenced in the logic. The view is created under the TDM (Table Data Model) schema, so there should be no issues as long the schemas are used correctly. However, when duplicating logic, one should check whether the values match, respectively whether the objects have the same record count:

-- checking whether the values match for an example
SELECT * 
FROM TDM.vEcoResProductV2Entity
WHERE ProductNumber = '0169'

SELECT * 
FROM EDM.vEcoResProductV2Entity
WHERE ProductNumber = '0169'


-- checking the number of records
SELECT count(*)
FROM TDM.vEcoResProductV2Entity

SELECT count(*)
FROM EDM.vEcoResProductV2Entity

There are several optimizations or restructuring opportunities in the logic, e.g., the 'product retail category' subquery can be created as a separate view. Also, the groupings of two tables for the 'dimension groups' could be encapsulated individually in views. This can increase the number of views created considerably, though upon case the views could prove to be useful for troubleshooting or reuse. 

Moreover, in environments with only a partition the constraints on the respective field could be removed (important decision though), while the value for the retail category could be hardcoded. Even if hardcoding values should be in general avoided, it's needed to compromise when there are more important constraints. 

One way to reduce logic's complexity is to create a "base view" for each table in which the needed transformations are made, respectively only the needed columns are used. Reducing the number of columns simplifies the identification of attributes. For example, the base view for dbo.EcoResProduct could be written as follows:

CREATE OR ALTER VIEW [TDM].[vEcoResProduct]
AS
/*
name: Products - master data (base view)
created: 01.04.2021
modified: 01.04.2021
*/
SELECT PRD.DisplayProductNumber ProductNumber
, PRD.ProductType 
, CASE PRD.ProductType
    WHEN 1 THEN 'Product'
    WHEN 2 THEN 'ProductMaster'
    WHEN 3 THEN 'ProductVariant'
  END ProductTypeName
, CASE WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 1 ELSE 2 END ProductSubtype 
, CASE 
    WHEN PRD.INSTANCERELATIONTYPE = 15969 THEN 'Item'
    ELSE 'Product master'  
   END ProductSubtypeName
, PRD.RetailSizeGroupId
, PRD.RetailColorGroupId
, PRD.RetailStyleGroupId
, PRD.RetailFlavorGroupId
, PRD.VariantConfigurationTechnology 
, CASE PRD.VariantConfigurationTechnology
    WHEN 0 THEN 'None'
    WHEN 1 THEN 'PredefinedVariants'
    WHEN 2 THEN 'DimensionBased'
    WHEN 3 THEN 'RuleBased'
    WHEN 4 THEN 'ConstraintBased'
  END VariantConfigurationTechnologyName
, PRD.IsProductVariantUnitConversionEnabled
, CASE PRD.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName
-- system
, PRD.RecId 
, PRD.Partition 
, PRD.ModifiedBy
FROM dbo.EcoResProduct PRD
WHERE NOT(PRD.InstanceRelationType  =  4211)
GO

A similar approach can be used for each table, or at least the tables that need further transformations. There are several choices - ideally one should find a good-enough approach and stick to it. There will be also exceptions, though the general design should prevail!

The bottom line, when using the Export to Data Lake add-in (*), as only the export of tables is supported, the logic can become occasionally complex, though still manageable (in other areas the tables are straightforward to use). When using Link to Data Lake, one can rely on data entities, as long they are synchronized correctly. However, one still needs to rely on tables to fill the gaps. 

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Happy coding!

Previous Post << ||>> Next Post

09 October 2023

SSRS (& Paginated Reports): Products Master Report in Dynamics 365 for Finance & Operations via Data Entities

With the Export to Data Lake add-in (*), respectively with Azure Synapse Link for Dataverse with Azure Data Lake, it's possible now to create Power BI paginated reports based directly on the Dynamics 365 Finance & Operations (D365 FO) data model as available in the Data Lake. With the add-in it's possible to export the tables from D365 FO, while with the second option it's possible to export data entities and hopefully soon base tables as well. 

The two features can be used in parallel in the same environment and even if they are based on different mechanisms, they synchronize D365 FO data to the Data Lake in standard data files that can be further consumed via a serverless/dedicated SQL server pool by reporting tools. Thus, one can build a data lakehouse/warehouse on top of the data for reporting. At least this was the approach before Microsoft Fabric. 

So, in the end a developer will have the option to choose between base tables and data entities, as synchronized to the Data Lake. Data entities encapsulate the logic as used by the D365 FO application and can thus allow developing a report with the minimum of overhead, at least in theory. Unfortunately, there are other downsides that make the use of base tables a better approach, at least for certain scenarios (e.g. data entities are not available for synchronization or doesn't include all needed fields). 

For the developer of reports the synchronization mechanism is less important. A set of database objects will be available, and reports can be built on top of them. Ideally, the base objects will have the same or similar names as in D365 FO, otherwise the differences between the two data models need to be translated in each piece of logic, which adds more complexity and overhead to the development.

Thus, to build a report to show the Product master data the developer can use the dbo.EcoResProductV2Entity data entity. Unfortunately, there are several important issues: (1) the name of the source entity can change as versions change, (2) there are fields based on Enums that store only the code and it's needed to map the corresponding values, (3) some fields from the base table need to be made available, (4) further transformations are needed (e.g., converting fields to formats). Thus, it's useful to encapsulate the logic into a view, when creating views is possible:

CREATE OR ALTER VIEW EDM.vEcoResProductV2Entity
AS 
/*
name:  Products - Master (base entity)
created: 01.04.2021
modified: 01.04.2021
*/
SELECT ITM.ProductType
, CASE ITM.ProductType
    WHEN 1 THEN 'Item'
    WHEN 2 THEN 'Service'
  END ProductTypeName
, ITM.ProductSubtype
, CASE ITM.ProductSubtype
    WHEN 1 THEN 'Product'
    WHEN 2 THEN 'Product Master'
    WHEN 3 THEN 'Product Variant'
   END ProductSubtypeName
, ITM.ProductNumber
, Replace(Replace(ITM.ProductName, char(10), ' '), char(13), ' ')  ProductName
, Replace(Replace(ITM.ProductDescription, char(10), ' '), char(13), ' ') ProductDescription
, ITM.RetailProductCategoryName 
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.ProductColorGroupId 
, ITM.ProductSizeGroupId 
, ITM.ProductStyleGroupId
, ITM.VariantConfigurationTechnology
, CASE ITM.VariantConfigurationTechnology
    WHEN 0 THEN 'None'
    WHEN 1 THEN 'Predefined Variants'
    WHEN 2 THEN 'Dimension Based'
    WHEN 3 THEN 'Rule Based'
    WHEN 4 THEN 'Constraint Based'
  END VariantConfigurationTechnologyName
, ITM.IsProductKit
, CASE ITM.IsProductKit WHEN 1 THEN 'Yes' ELSE 'No' END IsProductKitName
, ITM.IsCatchWeightproduct
, CASE ITM.IsCatchWeightproduct WHEN 1 THEN 'Yes' ELSE 'No' END IsCatchWeightproductName
, ITM.IsProductVariantUnitConversionEnabled
, CASE ITM.IsProductVariantUnitConversionEnabled WHEN 1 THEN 'Yes' ELSE 'No' END IsProductVariantUnitConversionEnabledName
-- system
, ITM.ProductDimensionGroupRecId
, ITM.StorageDimensionGroupRecId
, ITM.TrackingDimensionGroupRecId
, ITM.RetailCategoryRecId
, ITM.RecId
, ITM.Partition
FROM dbo.EcoResProductV2Entity ITM

As can be seen, the view was created in the EDM (Entity Data Model) and has the "v" prefix added to the original name. The EDM schema was created to store the objects based on data coming via data entities.

With this view's data can be consumed in the paginated report:

--Products - Master
SELECT ITM.ProductTypeName
, ITM.ProductSubtypeName
, ITM.ProductNumber
, ITM.ProductName
, ITM.RetailProductCategoryName 
, ITM.ProductDimensionGroupName 
, ITM.StorageDimensionGroupName 
, ITM.TrackingDimensionGroupName 
, ITM.ProductColorGroupId 
, ITM.ProductSizeGroupId 
, ITM.ProductStyleGroupId
, ITM.VariantConfigurationTechnologyName
, ITM.IsProductKitName
, ITM.IsCatchWeightproductName
FROM EDM.vEcoResProductV2Entity ITM
WHERE ITM.ProductName LIKE 'Mens%'
ORDER BY ITM.ProductNumber

One can use directly this query to generate the report, though in a second step, once the report was created, one might prefer to provide the query as variable to the corresponding dataset, as this allows better handling of the parameters and thus create the query dynamically as fit:

= "--Products - Master" & vbCrLf 
& "SELECT ITM.ProductTypeName" & vbCrLf 
& ", ITM.ProductSubtypeName" & vbCrLf 
& ", ITM.ProductNumber" & vbCrLf 
& ", ITM.ProductName" & vbCrLf 
& ", ITM.RetailProductCategoryName " & vbCrLf 
& ", ITM.ProductDimensionGroupName " & vbCrLf 
& ", ITM.StorageDimensionGroupName " & vbCrLf 
& ", ITM.TrackingDimensionGroupName " & vbCrLf 
& ", ITM.ProductColorGroupId " & vbCrLf 
& ", ITM.ProductSizeGroupId " & vbCrLf 
& ", ITM.ProductStyleGroupId" & vbCrLf 
& ", ITM.VariantConfigurationTechnologyName" & vbCrLf 
& ", ITM.IsProductKitName" & vbCrLf 
& ", ITM.IsCatchWeightproductName" & vbCrLf 
& ", ITM.IsProductVariantUnitConversionEnabledName" & vbCrLf 
& "FROM EDM.vEcoResProductV2Entity ITM" & vbCrLf 
& "WHERE 0=0 " & vbCrLf 
& IIf(Parameters!ProductName.Value<> "", " AND ITM.ProductName LIKE @ProductName ", "") & vbcrlf 
& IIf(Parameters!ProductNumber.Value<> "", " AND ITM.ProductNumber LIKE @ProductNumber", "") & vbcrlf 
& IIf(Parameters!ProductType.Value<> "", " 	AND ITM.ProductType = @ProductType", "") & vbcrlf 
& IIf(Parameters!ProductSubtype.Value<> "", " AND ITM.ProductSubtype = @ProductSubtype ", "") & vbcrlf 
& "ORDER BY ITM.ProductNumber" & vbCrLf

The expression can be created using an Excel formula (after vbCrLf observer the empty space needs to be set, otherwise the report will fail):
="& """ & A1 & """ & vbCrLf 

The four parameters created for exemplification need to be defined also in the dataset accordingly. And here's report's output:

For Product Number and Name the User can use wildcards - that's what the "*" besides parameter's Prompt signifies.

Unfortunately, the logic needs to be extended accordingly as soon as further attributes not available in the data entities are needed. In this scenario and several others, using directly the base tables allows more flexibility even if the logic behind the data entity needs to be duplicated.

Notes:
(1) When building and testing the report use only a small subset of the data. 
(2) For the dropdown parameters the "(all)" label was added which has an empty value. It allows in the expression to ignore the parameter, when selected. 
(3) In text values with a high probability of coming from Excel sheets (e.g. Product Name or Description) it might be a good idea to replace char (10) line feed, char (13) ENTER, and even char (9) horizontal Tab values, otherwise the data might not be exported correctly to Excel.
(4) If the definitions for Enums change, the mappings need to be changed accordingly. 
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link.

Happy coding!

Previous Post <<||>> Next Post

03 March 2023

Data Warehousing: Building a Modern Data Warehouse with Azure Synapse

Data Warehousing

Introduction

When building a data warehouse (DWH) several key words or derivatives of them appear in requirements: secure, flexible, simple, scalable, reliable, performant, non-redundant, modern, automated, real-timed, etc. As it proves in practice, all these requirements are sometimes challenging to address with the increased complexity of the architecture chosen. There are so many technologies on the DWH market promising all these at low costs, low effort and high ROI, though DWH projects continue to fail addressing the business and technical requirements.

On a basic level for building a DWH is needed a data storage layer and an ETL (Extract, Transfer, Load) tool responsible for the data movement between the various source systems and DWH, and eventually within the DWH itself. After that, each technology added to the landscape tends to increase the overall complexity (and should be regarded with a critical eye in what concerns the advantages and disadvantages).

Data Warehouse Architecture (on-premise)

A Reference Architecture

When building a DWH or a data migration solution, which has many of the characteristics of a DWH, from the many designs, I prefer to keep things as simple as possible.  An approach based on a performant database engine like SQL Server as storage layer and SSIS (SQL Server Integration Services) as ETL proved to be the best choice until now, allowing to address most of the technical requirements by design. Then come the choices on how and where to import and transform the data, at what level of granularity, on how the semantic layer is built, how the data are accessed, etc.

Being able to pull (see extract subprocess) the data from the data sources on a need by basis offers the most flexible approach, however there are cases in which the direct access to source data is not possible, having to rely on a push approach, where data are dumped regularly to a given location (e.g. FTP folder structure), following to be picked up as needed. It's actually a hybrid between a push and pull, because a fully push approach would mean pushing the data directly to the DWH, which can be also acceptable, though might offer lower control on data's movement and involve a few other challenges (e.g. permissions, concurrency). 

Data can be prepared for the DWH in the source systems (e.g. exposed via data objects or API calls), anywhere in between via ETL-based transformations (see transform subprocess) or directly in the DWH. I prefer importing the data (see load subprocess) 1:1 without any transformations from the various sources via SSIS (or similar technologies) into a set of tables that designated the staging area. It's true that in this way the ETL technology is used to a minimum, though unless there's a major benefit to use it for data transformations, using DWH's capabilities and SQL for data processing can provide better performance and flexibility

Besides the selection of the columns in scope (typically columns with meaningful values), it's important not to do any transformations in the extraction layer because the data is imported faster (eventually using fast load options as in SSIS) and it assures a basis for troubleshooting (as the data don't change between loads). Some filters can be applied only when the volume of data is high, and the subset of the data could be identified clearly (e.g. when data are partitioned based on a key like business unit, legal entity or creation date).

For better traceability, the staging schemas can reflect the systems they come from, the tables and the columns should have the same names, respectively same data types. On such tables no constraints are applied and no indexes are needed. They can be constructed however on the production tables (aka base tables) - copy of the tables from production. 

Some DWH architects try replicating the constraints from the source systems and/or add more constraints on top to define the various business rules. Rigor is good in some scenarios, though it can involve a considerable effort and it might be challenging to keep over time, especially when considering the impact of big data on DWH architectures. Instead of using constraints, building a set of SQL scripts that pinpoint the issues as reports allow more flexibility with the risk of having inconsistencies running wild through the reports. The data should be cleaned in the source system and not possible then properly addressed in the DWH. Applying constraints will make the data unavailable for reporting until data are corrected, while being more permissive would allow dirty data. Thus, either case has advantages or disadvantages, though the latter seems to be more appropriate. 

Indexes on the production schema should reflect the characteristics of the queries run on the data and shouldn't replicate the indexes from the source environments, even if some overlaps might exist. In practice, dropping the non-clustered indexes on the production tables before loading the data from staging, and recreating them afterwards proves to provide faster loading (see load optimization techniques). 

The production tables are used for building a "semantic" data model or something similar. Several levels of views, table-valued functions and/or indexed/materialized views allows building the dimensions and facts tables, the latter incorporating the business logic needed by the reports. Upon case, stored-procedures, physical or temporary tables, table variables can be used to prepare the data, though they tend to break the "free" flow of data as steps in-between need to be run. On the other side, in certain scenarios their use is unavoidable. 

The first level of views (aka base views) is based on the base tables without any joins, though they include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. Based on these "base" views the second level is built, which attempts to define the dimension and fact tables at the lowest granularity. These views include joins between tables coming from the same or different systems, respectively mappings of values defined in tables, and whatever it takes to build such entities. However, transformations on individual fields are pushed, when possible, to the lower level to minimize logic redundancy. From similar reasons, the logic could be broken down over two or more "helper" views when visible benefits could be obtained from it (e.g troubleshooting, reuse, maintenance). It's important to balance between creating too many helper views and encapsulating too much logic in a view. 

One of the design principles used in building the entities is to minimize the redundance of the fields used, ideally without having columns duplicated between entities at this level. This would facilitate the traceability of columns to the source tables within the "semantic" layer (typically in the detriment of a few more joins). In practice, one is forced to replicate some columns to simplify some parts of the logic. 

Further views can be built based on the dimension and fact entities to define the logic needed by the reports. Only these objects are used and no direct reference to the "base" tables or views are made. Moreover, to offer better performance when the views can be materialized or, when there's an important benefit, physically saved as table (e.g. having multiple indexes for different scenarios). It's the case of entities with considerable data volume called over and over. 

This approach of building the entities is usually flexible enough to address most of the reporting requirements, independently whether the technical solution has the characteristics of a DWH, data mart or data migration layer. Moreover, the overall architectural approach can be used on-premise as well in cloud architectures, where Azure SQL Server and ADF (Azure Data Factory) provide similar capabilities. Compared with standard SQL Server, some features might not be available, while other features might bring further benefits, though the gaps should be neglectable.

Data Management topics like Master Data Management (MDM), Data Quality Management (DQM) and/or Metadata Management can be addressed as well by using third-party tools or tools from the Microsoft stack - Master Data Services (MDS) and Data Quality Services (DQS) in combination with SSIS help addressing a wide range of scenarios - however these are optional. 

Moving to the Cloud

Within the context of big data, characterized by (high/variable) volume, value, variety, velocity, veracity, and further less important V's, the before technical requirements still apply, however within a cloud environment the overall architecture becomes more complex. Each component becomes a service. There are thus various services for data ingestion, storage, processing, sharing, collaboration, etc. The way data are processed involves also several important transformations: ETL becomes ELT, FTP and local storage by Data Lakes, data packages by data pipelines, stateful by stateless, SMP (Symmetric Multi-Processing) by MPP (Massive Parallel Processing), and so on.

As file storage is less expensive than database storage, there's an increasing trend of dumping business critical data into the Data Lake via data pipelines or features like Link to Data Lake or Export to Data Lake (*), which synchronize the data between source systems and Data Lake in near real-time at table or entity level. Either saved as csv, parquet, delta lake or any other standard file format, in single files or partitions, the data can be used directly or indirectly for analytics.

Cloud-native warehouses allow addressing topics like scalability, elasticity, fault-tolerance and performance by design, though further challenges appear as compute needs to be decoupled from storage, the workloads need to be estimated for assuring the performance, data may be distributed across data centers spanning geographies, the infrastructure is exposed to attacks, etc. 

Azure Synapse

If one wants to take advantage of the MPP architecture's power, Microsoft provides an analytical architecture based on Azure Synapse, an analytics service that brings together data integration, enterprise DWH, and big data analytics. Besides two types of SQL-based data processing services  (dedicated vs serverless SQL pools) it comes also with a Spark pool for in-memory cluster computing.

A DWH based on Azure Synapse is not that different from the reference architecture described above for an on-premise solution. Actually, a DWH based on a dedicated SQL pool (aka a physical data warehouse) involves the same steps mentioned above. 

Data Warehouse Architecture with Dedicated SQL Pool

The data can be imported via ETL/ELT pipelines in the DWH, though there are also mechanisms for consuming the data directly from the files stored in the Data Lake or Azure storage. CETAS (aka Create External Table as Select) can be defined on top of the data files, the external tables acting as "staging" or "base" tables in the architecture described above. When using a dedicated SQL pool it makes sense to use the CETAS as "staging" tables, the processed data following to be dumped to "optimized" physical tables for consumption and refreshed periodically. However, when this happens the near real-time character of data is lost. Using the CETAs as base tables would keep this characteristic as long the data isn't saved physically in tables or files, maybe in the detriment of performance.

Using a dedicated SQL pool for direct reporting can become expensive as the pool needs to be available at least during business hours for incoming user requests, or at least for importing the data and refreshing the datasets. When using the CETAS as a base table, a serverless (aka on-demand) SQL pool, which uses a per-pay-use billing model could prove to be more cost-effective and flexible in many scenarios. By design, it helps to keep the near real-time character of the data. Moreover, even if the data are actually moved from the source tables into the Data Lake, this architecture has the characteristics of a logical data warehouse:

Data Warehouse Architecture with Serverless SQL Pool

Unfortunately, unless one uses Spark tables, misuses views or adds an Azure SQL database to the architecture, there are no physical tables or materialized views in a serverless SQL pool. There's still the option to use data pipelines for regullarly exporting intermediary data to files (incl. over partitions or folders), even if this involves more overhead as it's not possible to export data over SQL syntax to files more than once (though this might change in the future). For certain scenario it could be useful to store data in a Azure SQL Server or similar database, including a dedicated SQL pool. 

Choosing between serverless and dedicated SQL pool is not an exclusive choice, both or all 3 types of pools (if we consider also the Spark pool) can be used in the architecture for addressing specific challenges, especially when we consider that there are important differences between the features available in each of the pools. Moreover, one can start the PoC based on the serverless SQL pool and when the solution became mature enough and used in all enterprise, parts of the logic or all of it can be migrated to a dedicated SQL pool. This would allow to save costs at the beginning in the detriment of further effort later. 

Talking about the physical storage, data engineers recommend defining within a Data Lake several layers (aka regions, zones) labeled as bronze, silver and gold (and probably platinum will join the club anytime soon). The bronze layer refers to the raw data available in the Data Lake, including the files on which the initial CETAS are defined upon. The silver refers to transformed, cleaned, enriched and integrated data, data resulting from the second layer of views described above. The gold layer refers to the data to which business logic was applied and prepared for consumption, data resulting from the final layer of views. Of course, data pipelines can be used to prepare the data at these stages, though a view-based approach offers more flexibility, are easier to troubleshoot, manage and reuse than data pipelines.

Ideally the gold data should involve no or minimal further transformation before reaching the users, though that's not realistic. Building a DWH takes a considerable time and the business can't usually wait until everything is in place. Therefore, reports based on DWH will continue to coexist with reports directly accessing the source data, which will lead to controversies. Enforcing a single source of truth will help to minimize the gap, though will not eliminate it completely. 

Closing Notes

These are just outlines of a minimal reference architecture. There's more to consider, as there are several alternatives (see [1] [2] [3] [4]) for each of the steps considered in here, each technology, new features or mechanisms opening new opportunities. The advantages and disadvantages should be always considered against the business needs and requirements. One approach, even if recommended, might not work for all, though unless there's an important requirement or an opportunity associated with an additional technology, deviating from reference architectures might not be such a good idea afterall.

Note:
(*) Existing customers have until 1-Nov-2024 to transition from Export to Data lake to Synapse link. Microsoft advises new customers to use Synapse Link. 


Resources:
[1] Microsoft Learn (2022) Modern data warehouse for small and medium business (link)
[2] Microsoft Learn (2022) Data warehousing and analytics (link)
[3] Microsoft Learn (2022) Enterprise business intelligence (link)
[4] Microsoft Learn (2022) Serverless Modern Data Warehouse Sample using Azure Synapse Analytics and Power BI (link)
[5] Coursera (2023) Data Warehousing with Microsoft Azure Synapse Analytics (link) [course, free to audit]
[6] SQLBits (2020) Mahesh Balija's Building Modern Data Warehouse with Azure Synapse Analytics (link)
[7] Matt How (2020) The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform (Amazon)
[8] James Serra's blog (2022) Data lake architecture (link)
[9] SQL Stijn (2022) SQL Building a Modern Lakehouse Data Warehouse with Azure Synapse Analytics: Moving your Database to the lake (link)
[10] Solliance (2022) Azure Synapse Analytics Workshop 400 (link) [GitHub repository]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.