26 April 2025

🏭🗒️Microsoft Fabric: Power BI Environments [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Apr-2025

Enterprise Content Publishing [2]

[Microsoft Fabric] Power BI Environments

  • {def} structured spaces within Microsoft Fabric that helps organizations manage the Power BI assets through the entire lifecycle
  • {environment} development 
    • allows to develop the solution
    • accessible only to the development team 
      • via Contributor access
    • {recommendation} use Power BI Desktop as local development environment
      • {benefit} allows to try, explore, and review updates to reports and datasets
        • once the work is done, upload the new version to the development stage
      • {benefit} enables collaborating and changing dashboards
      • {benefit} avoids duplication 
        • making online changes, downloading the .pbix file, and then uploading it again, creates reports and datasets duplication
    • {recommendation} use version control to keep the .pbix files up to date
      • [OneDrive] use Power BI's autosync
        • {alternative} SharePoint Online with folder synchronization
        • {alternative} GitHub and/or VSTS with local repository & folder synchronization
    • [enterprise scale deployments] 
      • {recommendation} separate dataset from reports and dashboards’ development
        • use the deployment pipelines selective deploy option [22]
        • create separate .pbix files for datasets and reports [22]
          • create a dataset .pbix file and uploaded it to the development stage (see shared datasets [22]
          • create .pbix only for the report, and connect it to the published dataset using a live connection [22]
        • {benefit} allows different creators to separately work on modeling and visualizations, and deploy them to production independently
      • {recommendation} separate data model from report and dashboard development
        • allows using advanced capabilities 
          • e.g. source control, merging diff changes, automated processes
        • separate the development from test data sources [1]
          • the development database should be relatively small [1]
    • {recommendation} use only a subset of the data [1]
      • ⇐ otherwise the data volume can slow down the development [1]
  • {environment} user acceptance testing (UAT)
    • test environment that within the deployment lifecycle sits between development and production
      • it's not necessary for all Power BI solutions [3]
      • allows to test the solution before deploying it into production
        • all tests must have 
          • View access for testing
          • Contributor access for report authoring
      • involves business users who are SMEs
        • provide approval that the content 
          • is accurate
          • meets requirements
          • can be deployed for wider consumption
    • {recommendation} check report’s load and the interactions to find out if changes impact performance [1]
    • {recommendation} monitor the load on the capacity to catch extreme loads before they reach production [1]
    • {recommendation} test data refresh in the Power BI service regularly during development [20]
  • {environment} production
    • {concept} staged deployment
      • {goal} help minimize risk, user disruption, or address other concerns [3]
        • the deployment involves a smaller group of pilot users who provide feedback [3]
    • {recommendation} set production deployment rules for data sources and parameters defined in the dataset [1]
      • allows ensuring the data in production is always connected and available to users [1]
    • {recommendation} don’t upload a new .pbix version directly to the production stage
      •  ⇐ without going through testing
  • {feature|preview} deployment pipelines 
    • enable creators to develop and test content in the service before it reaches the users [5]
  • {recommendation} build separate databases for development and testing 
    • helps protect production data [1]
  • {recommendation} make sure that the test and production environment have similar characteristics [1]
    • e.g. data volume, sage volume, similar capacity 
    • {warning} testing into production can make production unstable [1]
    • {recommendation} use Azure A capacities [22]
  • {recommendation} for formal projects, consider creating an environment for each phase
  • {recommendation} enable users to connect to published datasets to create their own reports
  • {recommendation} use parameters to store connection details 
    • e.g. instance names, database names
    • ⇐  deployment pipelines allow configuring parameter rules to set specific values for the development, test, and production stages
      • alternatively data source rules can be used to specify a connection string for a given dataset
        • {restriction} in deployment pipelines, this isn't supported for all data sources
  • {recommendation} keep the data in blob storage under the 50k blobs and 5GB data in total to prevent timeouts [29]
  • {recommendation} provide data to self-service authors from a centralized data warehouse [20]
    • allows to minimize the amount of work that self-service authors need to take on [20]
  • {recommendation} minimize the use of Excel, csv, and text files as sources when practical [20]
  • {recommendation} store source files in a central location accessible by all coauthors of the Power BI solution [20]
  • {recommendation} be aware of API connectivity issues and limits [20]
  • {recommendation} know how to support SaaS solutions from AppSource and expect further data integration requests [20]
  • {recommendation} minimize the query load on source systems [20]
    • use incremental refresh in Power BI for the dataset(s)
    • use a Power BI dataflow that extracts the data from the source on a schedule
    • reduce the dataset size by only extracting the needed amount of data 
  • {recommendation} expect data refresh operations to take some time [20]
  • {recommendation} use relational database sources when practical [20]
  • {recommendation} make the data easily accessible [20]
  • [knowledge area] knowledge transfer
    • {recommendation} maintain a list of best practices and review it regularly [24]
    • {recommendation} develop a training plan for the various types of users [24]
      • usability training for read only report/app users [24
      • self-service reporting for report authors & data analysts [24]
      • more elaborated training for advanced analysts & developers [24]
  • [knowledge area] lifecycle management
    • consists of the processes and practices used to handle content from its creation to its eventual retirement [6]
    • {recommendation} postfix files with 3-part version number in Development stage [24]
      • remove the version number when publishing files in UAT and production 
    • {recommendation} backup files for archive 
    • {recommendation} track version history 

    References:
    [1] Microsoft Learn (2021) Fabric: Deployment pipelines best practices [link]
    [2] Microsoft Learn (2024) Power BI: Power BI usage scenarios: Enterprise content publishing [link]
    [3] Microsoft Learn (2024) Deploy to Power BI [link]
    [4] Microsoft Learn (2024) Power BI implementation planning: Content lifecycle management [link]
    [5] Microsoft Learn (2024) Introduction to deployment pipelines [link]
    [6] Microsoft Learn (2024) Power BI implementation planning: Content lifecycle management [link]
    [20] Microsoft (2020) Planning a Power BI  Enterprise Deployment [White paper] [link]
    [22] Power BI Docs (2021) Create Power BI Embedded capacity in the Azure portal [link]
    [24] Paul Turley (2019)  A Best Practice Guide and Checklist for Power BI Projects

    Resources:

    Acronyms:
    API - Application Programming Interface
    CLM - Content Lifecycle Management
    COE - Center of Excellence
    SaaS - Software-as-a-Service
    SME - Subject Matter Expert
    UAT - User Acceptance Testing
    VSTS - Visual Studio Team System
    SME - Subject Matter Experts

    25 April 2025

    🏭🗒️Microsoft Fabric: Dataflows Gen2's Incremental Refresh [Notes] 🆕

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 25-Apr-2025

    [Microsoft Fabric] Incremental Refresh in Dataflows Gen2

    • {feature} enables to incrementally extract data from data sources, apply Power Query transformations, and load into various output destinations [5]
      • designed to reduce the amount of data that needs to be processed and retrieved from the source system [8]
      • configurable directly in the dataflow editor [8]
      • doesn't need to specify the historical data range [8]
        • ⇐ the dataflow doesn't remove any data from the destination that's outside the bucket range [8]
      • doesn't need to specify the parameters for the incremental refresh [8]
        • the filters and parameters are automatically added as the last step in the query [8]
    • {prerequisite} the data source 
      • supports folding [8]
      • needs to contain a Date/DateTime column that can be used to filter the data [8]
    • {prerequisite} the data destination supports incremental refresh [8]
      • available destinations
        • Fabric Warehouse
        • Azure SQL Database
        • Azure Synapse Analytics
        • Fabric Lakehouse [preview]
      • other destinations can be used in combination with incremental refresh by using a second query that references the staged data to update the data destination [8]
        • allows to use incremental refresh to reduce the amount of data that needs to be processed and retrieved from the source system [8]
          • a full refresh from the staged data to the data destination is still needed [8]
    • works by dividing the data into buckets based on a DateTime column [8]
      • each bucket contains the data that changed since the last refresh [8]
        • the dataflow knows what changed by checking the maximum value in the specified column 
          • if the maximum value changed for that bucket, the dataflow retrieves the whole bucket and replaces the data in the destination [8]
          • if the maximum value didn't change, the dataflow doesn't retrieve any data [8]
    • {limitation} 
      • the data destination must be set to a fixed schema [8]
      • ⇒table's schema in the data destination must be fixed and can't change [8]
        • ⇒ dynamic schema must be changed to fixed schema before configuring incremental refresh [8]
    • {limitation} the only supported update method in the data destination: replace
      • ⇒the dataflow replaces the data for each bucket in the data destination with the new data [8]
        • data that is outside the bucket range isn't affected [8]
    • {limitation} maximum number of buckets
      • single query: 50
        • {workaround} increase the bucket size or reduce the bucket range to lower the number of buckets [8]
      • whole dataflow: 150
        • {workaround} reduce the number of incremental refresh queries or increase the bucket size [8]
    • {downside} the dataflow may take longer to refresh after enabling incremental refresh [8]
      • because the additional overhead of checking if data changed and processing the buckets is higher than the time saved by processing less data [8]
      • {recommendation} review the settings for incremental refresh and adjust them to better fit the scenario
        • {option} increase the bucket size to reduce the number of buckets and the overhead of processing them [8]
        • {option} reduce the number of buckets by increasing the bucket size [8]
        • {option} disable incremental refresh [8]
    • {recommendation} don't use the column for detecting changes also for filtering [8]
      • because this can lead to unexpected results [8]
    • {setting} limit number of concurrent evaluation
      • setting the value to a lower number, reduces the number of requests sent to the source system [8]
      • via global settings >> Scale tab >> maximum number of parallel query evaluations
      • {recommendation} don't enable this limit unless there're issues with the source system [8]

    References:
    [5] Microsoft Learn (2023) Fabric: Save a draft of your dataflow [link]
    [8] Microsoft Learn (2025) Fabric: Incremental refresh in Dataflow Gen2 [link

    Resources:


    💫🗒️ERP Systems: Microsoft Dynamics 365's Business Process Catalog (BPC) [Notes]

    Disclaimer: This is work in progress intended to consolidate information from the various sources and not to provide a complete overview of all the features. Please refer to the documentation for a complete overview!

    Last updated: 25-Apr-2025

    Business Process Catalog - End-to-End Scenarios

    [Dynamics 365] Business Process Catalog (BPC)

    • {def} lists of end-to-end processes that are commonly used to manage or support work within an organization [1]
      • agnostic catalog of business processes contained within the entire D365 solution space [3]
        • {benefit} efficiency and time savings [3]
        • {benefit} best practices [3]
        • {benefit} reduced risk [3]
        • {benefit} technology alignment [3]
        • {benefit} scalability [3]
        • {benefit} cross-industry applicability [3]
      • stored in an Excel workbook
        • used to organize and prioritize the work on the business process documentation [1]
        • {recommendation} check the latest versions (see [R1])
      • assigns unique IDs to 
        • {concept} end-to-end scenario
          • describe in business terms 
            • not in terms of software technology
          • includes the high-level products and features that map to the process [3]
          • covers two or more business process areas
          • {purpose} map products and features to benefits that can be understood in business contexts [3]
        • {concept} business process areas
          • combination of business language and basic D365 terminology [3]
          • groups business processes for easier searching and navigation [1]
          • separated by major job functions or departments in an organization [1]
          • {purpose} map concepts to benefits that can be understood in business context [3]
          • more than 90 business process areas defined [1]
        • {concept} business processes
          • a series of structured activities and tasks that organizations use to achieve specific goals and objectives [3]
            • efficiency and productivity
            • consistency and quality
            • cost reduction
            • risk management
            • scalability
            • data-driven decision-making
          • a set of tasks in a sequence that is completed to achieve a specific objective [5]
            • define when each step is done in the implementation [5] [3]
            • define how many are needed [5] [3]
          • covers a wide range of structured, often sequenced, activities or tasks to achieve a predetermined organizational goal
          • can refer to the cumulative effects of all steps progressing toward a business goal
          • describes a function or process that D365 supports
            • more than 700 business processes identified
            • {goal} provide a single entry point with links to relevant product-specific content [1]
          • {concept} business process guide
            • provides documentation on the structure and patterns of the process along with guidance on how to use them in a process-oriented implementation [3]
            • based on a catalog of business process supported by D365 [3]
          • {concept} process steps 
            • represented sequentially, top to bottom
              • can include hyperlinks to the product documentation [5] 
              • {recommendation} avoid back and forth in the steps as much as possible [5]
            • can be
              • forms used in D365 [5]
              • steps completed in LCS, PPAC, Azure or other Microsoft products [5]
              • steps that are done outside the system (incl. third-party system) [5]
              • steps that are done manually [5]
            • are not 
              • product documentation [5]
              • a list of each click to perform a task [5]
          • {concept} process states
            • include
              • project phase 
                • e.g. strategize, initialize, develop, prepare, operate
              • configuration 
                • e.g. base, foundation, optional
              • process type
                • e.g. configuration, operational
        • {concept} patterns
          • repeatable configurations that support a specific business process [1]
            • specific way of setting up D365 to achieve an objective [1]
            • address specific challenges in implementations and are based on a specific scenario or best practice [6]
            • the solution is embedded into the application [6]
            • includes high-level process steps [6]
          • include the most common use cases, scenarios, and industries [1]
          • {goal} provide a baseline for implementations
            • more than 2000 patterns, and we expect that number to grow significantly over time [1]
          • {activity} naming a new pattern
            • starts with a verb
            • describes a process
            • includes product names
            • indicate the industry
            • indicate AppSource products
        • {concept} reference architecture 
          • acts as a core architecture with a common solution that applies to many scenarios [6]
          • typically used for integrations to external solutions [6]
          • must include an architecture diagram [6]
      • {concept} process governance
        • {benefit} improved quality
        • {benefit} enhanced decision making
        • {benefit} agility adaptability
        • {benefit{ Sbd alignment
        • {goal} enhance efficiency 
        • {goal} ensure compliance 
        • {goal} facilitate accountability 
        • {concept} policy
        • {concept} procedure
        • {concept} control
      • {concept} scope definition
        • {recommendation} avoid replicating current processes without considering future needs [4]
          • {risk} replicating processes in the new system without re-evaluating and optimizing [4] 
          • {impact} missed opportunities for process improvement [4]
        • {recommendation} align processes with overarching business goals rather than the limitations of the current system [4]
      • {concept} guidance hub
        • a central landing spot for D365 guidance and tools
        • contains cross-application documentations
    • {purpose} provide considerations and best practices for implementation [6]
    • {purpose} provide technical information for implementation [6]
    • {purpose} provide link to product documentation to achieve the tasks in scope [6]
    Previous Post <<||>> Next Post 

    References:
    [1] Microsoft Learn (2024) Dynamics 365: Overview of end-to-end scenarios and business processes in Dynamics 365 [link]
    [2] Microsoft Dynamics 365 Community (2023) Business Process Guides - Business Process Guides [link]
    [3] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance - Part 2 Introduction to Business Processes [link]
    [4] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance - Part 3: Using the Business Process Catalog to Manage Project Scope and Estimation [link]
    [5] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance - Part 4: Authoring Business Processes [link]
    [6] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance - Part 5:  Authoring Business Processes Patterns and Use Cases [link]
    [7] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance  - Part 6: Conducting Process-Centric Discovery [link]
    [8] Microsoft Dynamics 365 Community (2024) Business Process Catalog and Guidance  - Part 7: Introduction to Process Governance [link]

    Resources:
    [R1] GitHub (2024) Business Process Catalog [link]
    [R2] Microsoft Learn (2024) Dynamics 365 guidance documentation and other resources [link]
    [R3] Dynamics 365 Blog (2025) Process, meet product: The business process catalog for Dynamics 365 [link]

    Acronyms:
    3T - Tools, Techniques, Tips
    ADO - 
    BPC - Business Process Catalog
    D365 - Dynamics 365
    LCS - Lifecycle Services
    PPAC - Power Platform admin center
    RFI - Request for Information
    RFP - Request for Proposal

    24 April 2025

    🧭Business Intelligence: Perspectives (Part XXX: The Data Science Connection)

    Business Intelligence Series
    Business Intelligence Series

    Data Science is a collection of quantitative and qualitative methods, respectively techniques, algorithms, principles, processes and technologies used to analyze, and process amounts of raw and aggregated data to extract information or knowledge it contains. Its theoretical basis is rooted within mathematics, mainly statistics, computer science and domain expertise, though it can include further aspects related to communication, management, sociology, ecology, cybernetics, and probably many other fields, as there’s enough space for experimentation and translation of knowledge from one field to another.  

    The aim of Data Science is to extract valuable insights from data to support decision-making, problem-solving, drive innovation and probably it can achieve more in time. Reading in between the lines, Data Science sounds like a superhero that can solve all the problems existing out there, which frankly is too beautiful to be true! In theory everything is possible, when in practice there are many hard limitations! Given any amount of data, the knowledge that can be obtained from it can be limited by many factors - the degree to which the data, processes and models built reflect reality, and there can be many levels of approximation, respectively the degree to which such data can be collected consistently. 

    Moreover, even if the theoretical basis seems sound, the data, information or knowledge which is not available can be the important missing link in making any sensible progress toward the goals set in Data Science projects. In some cases, one might be aware of what's missing, though for the data scientist not having the required domain knowledge, this can be a hard limit! This gap can be probably bridged with sensemaking, exploration and experimentation approaches, especially by applying models from other domains, though there are no guarantees ahead!

    AI can help in this direction by utilizing its capacity to explore fast ideas or models. However, it's questionable how much the models built with AI can be further used if one can't build mechanistical mental models of the processes reflected in the data. It's like devising an algorithm for winning at lottery small amounts, though investing more money in the algorithm doesn't automatically imply greater wins. Even if occasionally the performance is improved, it's questionable how much it can be leveraged for each utilization. Statistics has its utility when one studies data in aggregation and can predict average behavior. It can’t be used to predict the occurrence of events with a high precision. Think how hard the prediction of earthquakes or extreme weather is by just looking at a pile of data reflecting what’s happening only in a certain zone!

    In theory, the more data one has from different geographical areas or organizations, the more robust the models can become. However, no two geographies, respectively no two organizations are alike: business models, the people, the events and other aspects make global models less applicable to local context. Frankly, one has more chances of progress if a model is obtained by having a local scope and then attempting to leverage the respective model for a broader scope. Even then, there can be differences between the behavior or phenomena at micro, respectively at macro level (see the law of physics). 

    This doesn’t mean that Data Science or AI related knowledge is useless. The knowledge accumulated by applying various techniques, models and programming languages in problem-solving can be more valuable than the results obtained! Experimentation is a must for organizations to innovate, to extend their knowledge base. It’s also questionable how much of the respective knowledge can be retained and put to good use. In the end, each organization must determine this by itself!

    💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)


    Introduction

    One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

    The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution. 

    The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

    Create the Lakehouse

    It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse. 

    More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages. 

    Create the Schema

    A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

    -- create schema
    CREATE Schema Orders
    

    Create a Staging Area

    The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity. 

    One should also consider rules for archiving or even deleting the files.

    Upload the Files

    Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process. 

    When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

    Load the Data

    A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
    Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

    Load file to new table


    Of course, the table's name must be unique within the Schema and the further properties must define files' definition. 

    One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

    If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category. 

    Review the Metadata

    Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

    -- retrieve the list of tables
    SELECT * 
    FROM INFORMATION_SCHEMA.TABLES
    
    WHERE TABLE_SCHEMA = 'orders'
    ORDER BY TABLE_SCHEMA  

    Further on, one can review columns' definition:
     
    -- retrieve column metadata
    SELECT TABLE_CATALOG
    , TABLE_SCHEMA
    , TABLE_NAME
    , COLUMN_NAME
    , ORDINAL_POSITION
    , DATA_TYPE
    , CHARACTER_MAXIMUM_LENGTH
    , NUMERIC_PRECISION
    , NUMERIC_SCALE
    , DATETIME_PRECISION
    , CHARACTER_SET_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = 'orders'
    ORDER BY ORDINAL_POSITION

    It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace. 

    Alternatively, one can use a notebook with the following SQL-based commands: 

    %%sql
    
    DESCRIBE TABLE LH_SCM_PoC.orders.sales;
    
    DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;
    

    One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

    %%sql
    
    -- modify a table's COMMENT
    COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';
    
    -- modify columns' COMMENT for an existing table
    ALTER TABLE LH_SCM_DWH.orders.sales  
    ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';
    

    Data Validation

    Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives. 

    In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major  data quality issues, and this can also affect  PoC's results as soon as the outcomes are validated against the expectations! 

    Data Understanding 

    Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

    The analysis could for example consider the following points:

    /* validation of Products */
    
    -- review duplicated product numbers (should be 0)
    SELECT ProductName
    , count(*) RecordCount
    FROM orders.products
    GROUP BY ProductName
    HAVING count(*)>1
    
    -- review most (in)expensive products
    SELECT top 100 ProductID
    , ProductName
    , Category
    , ListPrice 
    FROM orders.products
    ORDER BY ListPrice DESC --ASC
    
    -- review category distribution
    SELECT Category
    , count(*) RecordCount 
    FROM orders.products
    GROUP BY Category
    ORDER BY RecordCount DESC
    
    -- review price ranges (
    SELECT Len(floor(ListPrice)) RangeCount
    , count(*) RecordCount 
    FROM orders.products
    GROUP BY Len(floor(ListPrice)) 
    ORDER BY RangeCount DESC
    
    /* validation of Customers */
    
    -- duplicated email address 
    SELECT CST.CustomerID
    , CST.FirstName
    , CST.LastName 
    , CST.EmailAddress 
    , DUP.RecordCount
    FROM (-- duplicates
    	SELECT EmailAddress
    	, count(*) RecordCount 
    	FROM orders.customers 
    	GROUP BY EmailAddress 
    	HAVING count(*)>1
    	) DUP
    	JOIN orders.customers CST
    	   ON DUP.EmailAddress = CST.EmailAddress
    ORDER BY DUP.RecordCount DESC
    , DUP.EmailAddress 
    
    -- duplicated Customer names (not necessarily duplicates)
    SELECT CST.CustomerID
    , CST.FirstName
    , CST.LastName 
    , CST.EmailAddress 
    , DUP.RecordCount
    FROM (-- duplicates
    	SELECT FirstName
    	, LastName
    	, count(*) RecordCount 
    	FROM orders.customers 
    	GROUP BY FirstName
    	, LastName 
    	HAVING count(*)>1
    	) DUP
    	JOIN orders.customers CST
    	   ON DUP.FirstName = CST.FirstName
          AND DUP.LastName = CST.LastName
    ORDER BY DUP.RecordCount DESC
    , DUP.FirstName
    , DUP.LastName
    
    /* validation of Orders */
    
    -- review a typical order
    SELECT SalesOrderID
    , OrderDate
    , CustomerID
    , LineItem
    , ProductID
    , OrderQty
    , LineItemTotal
    FROM orders.orders
    WHERE SalesOrderID = 71780
    ORDER BY SalesOrderID 
    , LineItem
    
    -- review orders' distribution by month
    SELECT Year(OrderDate) Year
    , Month(OrderDate) Month
    , count(*) RecordCount
    FROM orders.orders
    GROUP BY Year(OrderDate) 
    , Month(OrderDate) 
    ORDER BY Year
    , Month
    
    -- checking for duplicates
    SELECT SalesOrderID
    , LineItem
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY SalesOrderID
    , LineItem
    HAVING count(*)>1
    
    -- checking for biggest orders
    SELECT SalesOrderID
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY SalesOrderID
    HAVING count(*) > 10
    ORDER BY NoRecords DESC
    
    -- checking for most purchased products
    SELECT ProductID
    , count(*) NoRecords
    FROM orders.orders ord 
    GROUP BY ProductID
    HAVING count(*) > 8
    ORDER BY NoRecords DESC
    
    -- checking for most active customers
    SELECT CustomerID
    , count(*) RecordCount
    FROM orders.orders ord 
    GROUP BY CustomerID
    HAVING count(*) > 10
    ORDER BY RecordCount DESC
    
    /* join checks */
    
    -- Prders without Product (should be 0)
    SELECT count(*) RecordCount
    FROM orders.orders ord 
    	 LEFT JOIN orders.products prd
    	   ON ord.ProductID = prd.ProductID
    WHERE prd.ProductID IS NULL
    
    -- Prders without Customer (should be 0)
    SELECT count(*) RecordCount
    FROM orders.orders ORD 
    	 LEFT JOIN orders.customers CST
    	   ON ORD.CustomerID = CST.CustomerID
    WHERE CST.CustomerID IS NULL
    
    -- Products without Orders (153 records)
    SELECT count(*) RecordCount
    FROM orders.products prd
    	 LEFT JOIN orders.orders ord 
    	   ON prd.ProductID = ord.ProductID 
    WHERE ord.ProductID IS NULL
    
    
    -- Customers without Orders (815 records)
    SELECT count(*) RecordCount
    FROM orders.customers CST
    	 LEFT JOIN orders.orders ORD
    	   ON ORD.CustomerID = CST.CustomerID
    WHERE ORD.CustomerID IS NULL
    

    The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

    Building the Logic

    Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations. 

    Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if  that's not mandatory, it might still be a good idea to make steps in the respective direction. 

    In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

    a) all records from fact table + dimension records

    The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

    -- create the view
    CREATE OR ALTER VIEW orders.vSalesOrders
    -- Sales Orders with Product & Customer information
    AS
    SELECT ORD.SalesOrderID
    , ORD.OrderDate
    , ORD.CustomerID
    , CST.FirstName 
    , CST.LastName
    , CST.EmailAddress
    , ORD.LineItem
    , ORD.ProductID
    , PRD.ProductName 
    , PRD.Category
    , ORD.OrderQty
    , ORD.LineItemTotal
    , PRD.ListPrice 
    , ORD.OrderQty * PRD.ListPrice ListPriceTotal
    FROM orders.orders ORD 
    	 JOIN orders.products PRD
    	   ON ORD.ProductID = PRD.ProductID
    	 JOIN orders.customers CST
    	   ON ORD.CustomerID = CST.CustomerID
    
    -- test the view   
    SELECT *
    FROM orders.vSalesOrders
    WHERE SalesOrderID = 71780
    

    One can use full joins unless some of the references dimensions are not available.  

    b) aggregated data for all dimension combinations

    The previous view allows to aggregate the data at the various levels of details:

    -- Sales volume by Customer & Product
    SELECT ORD.EmailAddress
    , ORD.ProductName 
    , ORD.Category
    , SUM(ORD.OrderQty) OrderQty
    , SUM(ORD.LineItemTotal) LineItemTotal
    FROM orders.vSalesOrders ORD 
    WHERE ORD.OrderDate >= '2022-06-01'
      AND ORD.OrderDate < '2022-07-01'
    GROUP BY ORD.EmailAddress
    , ORD.ProductName 
    , ORD.Category
    ORDER BY ORD.EmailAddress
    , ORD.ProductName 
    

    One can comment out the dimensions not needed. The query can be included in a view as well. 

    c) all records from each dimension table + aggregated fact records

    Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

    -- create the user-defined function
    CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
        @StartDate date NULL,
        @EndDate date NULL
    )
    RETURNS TABLE
    -- Sales volume by Product
    AS
    RETURN (
    SELECT PRD.ProductID
    , PRD.ProductName 
    , PRD.Category
    , ORD.FirstOrderDate
    , ORD.LastOrderDate 
    , IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
    , IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
    , IsNull(ORD.OrderCount, 0) OrderCount
    , IsNull(ORD.LineCount, 0) LineCount
    FROM orders.products PRD
         OUTER APPLY (
    		SELECT Min(ORD.OrderDate) FirstOrderDate
    		, Max(ORD.OrderDate) LastOrderDate 
    		, SUM(ORD.OrderQty) TotalSalesQty
    		, SUM(ORD.LineItemTotal) TotalSalesValue
    		, count(DISTINCT SalesOrderID) OrderCount
    		, count(*) LineCount
    		FROM orders.orders ORD 
    		WHERE ORD.ProductID = PRD.ProductID
    		  AND ORD.OrderDate >= @StartDate 
    		  AND ORD.OrderDate < @EndDate 
    	 ) ORD
    );
    
    -- test the user-defined function
    SELECT *
    FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
    WHERE TotalSalesValue <> 0
    ORDER BY TotalSalesValue DESC
    , LastOrderDate DESC
    
    
    -- create the user-defined function
    CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
        @StartDate date NULL,
        @EndDate date NULL
    )
    RETURNS TABLE
    -- Sales volume by Customer
    AS
    RETURN (
    SELECT CST.CustomerID
    , CST.FirstName 
    , CST.LastName
    , CST.EmailAddress
    , ORD.FirstOrderDate
    , ORD.LastOrderDate 
    , IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
    , IsNull(ORD.OrderCount, 0) OrderCount
    , IsNull(ORD.LineCount, 0) LineCount
    FROM orders.customers CST
         OUTER APPLY (
    		SELECT Min(ORD.OrderDate) FirstOrderDate
    		, Max(ORD.OrderDate) LastOrderDate 
    		, SUM(ORD.LineItemTotal) TotalSalesValue
    		, count(DISTINCT SalesOrderID) OrderCount
    		, count(*) LineCount
    		FROM orders.orders ORD 
    		WHERE ORD.CustomerID = CST.CustomerID
    		  AND ORD.OrderDate >= @StartDate 
    		  AND ORD.OrderDate < @EndDate 
    	 ) ORD
    );
    
    -- test the user-defined function
    SELECT *
    FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
    WHERE TotalSalesValue <> 0
    ORDER BY TotalSalesValue DESC
    , LastOrderDate DESC
    

    When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness. 

    Further Comments

    The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

    The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example, 

    In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective. 

    No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons. 

    The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies. 

    |>> Next Post

    References
    [1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

    Resources
    [1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.