Showing posts with label Microsoft Fabric. Show all posts
Showing posts with label Microsoft Fabric. Show all posts

07 August 2024

Business Intelligence: Data Modeling (Part II: From Data to Data Models)

Business Intelligence Series
Business Intelligence Series

A data model can be defined as an abstract, self-contained, logical definition of the data structures available in a database or similar repositories. It’s typically an abstraction of the data structures underpinning a set of processes, procedures and business logic used for a predefined purpose. A data model can be formed also of unrelated micromodels, depicting thus various aspects of a business. 

The association between data and data models is bidirectional. Given a set of data, a data model can be built to underpin the respective data. Conversely, one can create or generate data based on a data model. However, in business setups a bidirectional relationship between data and the data model(s) underpinning them is more realistic as the business evolves. In extremis, the data model can be used to reflect a business’ needs, at least when the respective needs are addressed accordingly by extending the data model(s).

Given a set of data (e.g. the data stored in one or more spreadsheets or other type of files) there can be defined in theory multiple data models to reflect the respective data. Within a data model, the fields (aka attributes) are partitioned into a set of data entities, where a data entity is thus a nonunique grouping of attributes that attempt to define together one unitary aspect of the world. Customers, Vendors, Products, Invoices or Sales Orders are examples of such data entities, though entities can have a broader granularity (e.g. Customers can be modeled over several tables like Entity, Addresses, Contact information, etc.). 

From an operational database’s perspective, a data entity is based on one or more tables, though several entities can share some of the tables. From a BI artifact’s perspective, an entity should be easy to create from the underlying tables, with a minimal set of transformations. Ideally, the BI data model should be as close as possible to the needed entity for reporting, however an optimal solution lies usually somewhere in between. In this resides the complexity of modeling BI solutions – providing an optimal data model which can be easily built on the source tables, and which allows addressing all or at least most of the BI requirements.

In other words, we deal with two optimization problems of two distinct data models. On one side the business data model must be flexible enough to provide fast read/write operations while keeping the referential data’s granularity efficient. Conversely, a BI data model needs to abstract these entities and provide a fast way of processing the data, while making data reads extremely efficient. These perspectives must apply when we move to Microsoft Fabric too. 

The operational data layer must provide this abstraction, and in this resides the complexity of building optimal BI solutions. This is the layer at which the modeling problems need to be tackled. The challenge of BI and Analytics resides in finding an optimal data model that allows us to address most or ideally all the BI requirements. Several overlapping layers of abstraction may be built in the process.

Looking at the data modeling techniques used in notebooks and other similar solutions, data modeling has the chance of becoming a redundant practice prone to errors. Moreover, data models have a tendency of being multilayered and of being based on certain perspectives into the processes they model. Providing reliable flexible models involves finding the right view into the data for modeling aspects of the business. Database views allow us to easily model such perspectives, often in a unique way. Moving away from them just shifts the burden on the multiple solutions built around the base data, which can create other important challenges. 

Previous Post <<||>> Next Post

13 June 2024

Business Intelligence: One Person Can’t Learn or Do Everything in Microsoft Fabric

Business Intelligence Series
Business Intelligence Series

Today’s Explicit Measures webcast [1] considered an article written by Kurt Buhler (The Data Goblins): [Microsoft] "Fabric is a Team Sport: One Person Can’t Learn or Do Everything" [2]. It’s a well-written article that deserves some thought as there are several important points made. I can’t say I agree with the full extent of some statements, even if some disagreements are probably just a matter of semantics.

My main disagreement starts with the title “One Person Can’t Learn or Do Everything”. As clarified in webcast's chat, the author defines “everything" as an umbrella for “all the capabilities and experiences that comprise Fabric including both technical (like Power BI) or non-technical (like adoption data literacy) and everything in between” [1].

For me “everything” is relative and considers a domain's core set of knowledge, while "expertise" (≠ "mastery") refers to the degree to which a person can use the respective knowledge to build back-to-back solutions for a given area. I’d say that it becomes more and more challenging for beginners or average data professionals to cover the core features. Moreover, I’d separate the non-technical skills because then one will also need to consider topics like Data, Project, Information or Knowledge Management.

There are different levels of expertise, and they can vary in depth (specialization) or breadth (covering multiple areas), respectively depend on previous experience (whether one worked with similar technologies). Usually, there’s a minimum of requirements that need to be covered for being considered as expert (e.g. certification, building a solution from beginning to the end, troubleshooting, performance optimization, etc.). It’s also challenging to roughly define when one’s expertise starts (or ends), as there are different perspectives on the topics. 

Conversely, the term expert is in general misused extensively, sometimes even with a mischievous intent. As “expert” is usually considered an external consultant or a person who got certified in an area, even if the person may not be able to build solutions that address a customer’s needs. 

Even data professionals with many years of experience can be overwhelmed by the volume of knowledge, especially when one considers the different experiences available in MF, respectively the volume of new features released monthly. Conversely, expertise can be considered in respect to only one or more MF experiences or for one area within a certain layer. Lot of the knowledge can be transported from other areas – writing SQL and complex database objects, modelling (enterprise) semantic layers, programming in Python, R or Power Query, building data pipelines, managing SQL databases, etc. 

Besides the standard documentation, training sessions, and some reference architectures, Microsoft made available also some labs and other material, which helps discovering the features available, though it doesn’t teach people how to build complete solutions. I find more important than declaring explicitly the role-based audience, the creation of learning paths for the various roles.

During the past 6-7 months I've spent on average 2 days per week learning MF topics. My problem is not the documentation but the lack of maturity of some features, the gaps in functionality, identifying the respective gaps, knowing what and when new features will be made available. The fact that features are made available or changed while learning makes the process more challenging. 

My goal is to be able to provide back-to-back solutions and I believe that’s possible, even if I might not consider all the experiences available. During the past 22 years, at least until MF, I could build complete BI solutions starting from requirements elicitation, data extraction, modeling and processing for data consumption, respectively data consumption for the various purposes. At least this was the journey of a Software Engineer into the world of data. 

References:
[1] Explicit Measures (2024) Power BI tips Ep.328: Microsoft Fabric is a Team Sport (link)
[2] Data Goblins (2024) Fabric is a Team Sport: One Person Can’t Learn or Do Everything (link)

11 June 2024

Business Intelligence: Is Microsoft Fabric Ready? (Part I: Some Random Thoughts)

Business Intelligence Series
Business Intelligence Series

When writing a Business Case, besides the problem and solution(s) high-level descriptions, is important to rougly estimate how much it costs, how long it takes, respectively how many resources are needed and for what activities. A proof-of-concept (PoC) might not need an explicit business case, though the same high-level information is needed at least for the planning of resources and a formal approval.

Given that there are several analytical experiences in Microsoft Fabric (MF), it’s clear that can’t be anymore a reference architecture that can be recommended for customers. Frankly, that ship has sailed even since the introduction of Microsoft Synapse, if not earlier, with the movement to the cloud. Also, there’s no one size fits all as certain building blocks make sense only in certain scenarios (e.g. organization scale, data volume or source’s type). Moreover, even if MF has been generally available for quite some time, customers and service providers ask themselves whether the available features are enough for building analytics solutions based on it. 

“Is Fabric Ready?” was the topic of today’s Explicit Measures webcast [1]. Probably the answer is as usual “it depends” and the general recommendation is to do a PoC to check solution's feasibility. Conversely, MF may be the best approach to consider if integration with other systems (e.g. Dynamics 365, Dataverse) is needed. 

What the customers need are some rough realistic estimates they can base any planning upon (at least for a PoC if not for the whole project) in terms of making the data available into OneLake, building a semantic model, respectively processing and making the data available for consumption. Ideally, one needs a translation of the various steps as done earlier. For example, how long it takes to make the data available in OneLake, how long it takes to move the data physically or logically though the various layers, to build semantic models, etc. 

Probably, some things can be achieved in a matter of days, at least if one knows what one’s doing. However, we are talking here about a new architecture that may resemble for some of an unknow territory. Even if old and new techniques can be mixed, there are further implications or improvements that can be considered. There are many webcasts, blog posts and other material on how to do things, on what’s possible, though building a functioning solution from beginning to the end, even as PoC, requires more than putting all this together. 

Just making the data flow from point A to B or C is not enough - data security, data governance and a few other topics like scalability and availability need to be considered as well. Security and governance are also the areas in which probably more features must be considered. For many customers starting now with MF, the hope is that most of these features will be available during the time the solutions are ready for production.

From a cost perspective, there’s the cost of data at rest, in transit, the licensing for MF and the other components involved. Ideally, one should start small and increase capacities as needed, though small can vary from case to case, while it’s important to find out the optimum. Starting in the middle could be an alternative approach even if may involve higher costs. If one starts small, the costs for PoC can be neglectable, though sooner or later a compromise is needed to provide an acceptable performance. 

In terms of human resources, the topic is more complex (see [2]), and it depends largely on the nature of the project. The pool of skillsets is the most important constraint or enabler such projects can have.

Previous Post <<||>> Next Post

References:
[1] Explicit Measures (2024) Power BI tips Ep.327: Is Fabric Ready? (link)
[2] Explicit Measures (2024) Power BI tips Ep.321: Building and BI Team (link)

23 May 2024

Microsoft Fabric: Domains (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 29-May-2024

Domains & Entities

Domains

  • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • associated with workspaces 
      • {benefit} allows to group data into business domains [1]
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [1]
      • {benefit} enables a better consumption experience [1]
        • simplify discovery and consumption 
  • provide a management boundary between tenant and workspace enabling domain admins to have more granular control over multiple workspaces [6]
    • some tenant-level settings for managing and governing data can be delegated to the domain level [2]
  • [security] domain roles
    • Fabric admins (or higher)
      • can create and edit domains
      • can specify domain admins and contributors
      • can associate workspaces with domains [4]
      • can see, edit, and delete all domains in the admin portal [4]
      • domain admins
    • business owners or experts of a domain
      • can update the domain description
      • can define contributors
      • can associate workspaces with the domain [4]
      • can define and update the domain image
      • can override tenant settings for any specific settings the tenant admin has delegated to the domain level [4]
      • can't delete the domain, change the domain name, or add/delete other domain admins
      • can only see and edit the domains they're admins of.
    • domain contributors
      • ⇐ must be a workspace admin
      • can associate their workspaces with a domain or change the current domain association
      • don’t have access to the Domains page in the admin portal
    • domain users
      • can share lakehouse with other domain users without giving access to workspace and other artifacts [4]
  • {concept} default domain
    • a domain that has been specified as the default domain for specific users and/or security groups [3]
      • ⇒ when these users/security groups create/update a new/unassigned workspace, that workspace will automatically be assigned to that domain [3]
      • ⇒ generally automatically become domain contributors of the workspaces that are assigned in this manner [3]
  • {feature} subdomains
    • a way for fine tuning the logical grouping data under a domain [1]
      • subdivisions of a domain
      • only one level is supported in the hierarchy
    • visible as part of the domains filter and as part of the item location path
    • no setup available
      • [planned] some domain settings will be added to subdomains as well

Acronyms:
MF - Microsoft Fabric

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link
[5] Microsoft Fabric Updates Blog (2024) Easily implement data mesh architecture with domains in Fabric, by Naama Tsafrir 
(link)
[6] Microsoft (2024) Microsoft Fabric Domains – Data Mesh [with Naama Tsafrir & Assaf Shemesh]


07 May 2024

Microsoft Fabric: The Metrics Layer (Notes) [new feature]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

  • {definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse
    • ⇐ {important} feature still in private preview 
  • {goal} extend existing infrastructure 
    • {benefit} leverages and extends existing features
  • {goal} provide consistent definitions and descriptions [1]
    • consistent definitions that include besides business logic additional dimensions and filters [1]
    • ⇒ {benefit} allows to standardize the metrics across the organization
    • ⇒ {benefit} enforce to enforce a SSoT
  • {goal} easy management 
    • via management views 
    • [feature] lineage 
    • [feature] source control
    • [feature] duplicate identification
    • [feature] push updates to downstream uses of the metrics 
  • {goal}searchable and discoverable metrics 
    • {feature} integration
      • based on Sempy fabric package
        • ⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric
  • {goal}trust
    • [feature] trust indicators
    • {benefit} facilitates report's adoption
  • {feature} metric set 
    • {definition} a Fabric item that groups together a set of metrics into a mini-model
    • {benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
    • associated with a single domain
      • ⇒ supports the data mesh architecture
    • shareable 
      • can be shared with other users
    • {action} create metric set
      • creates the actual artifact, to which metrics can be added 
  • {feature} metric
    •  {definition} a way to elevate the measures from the various semantic models existing in the organization
    • tied to the original semantic model
      • ⇒ {benefit} allows to see how a metric is used across the solutions 
    • reusable
      • can be reused in other fabric artifacts
        • new reports on the Power BI service
        • notebooks 
          • by copying the code
      • can be reused in Power BI
        • via OneLake data hub menu element
      • can be chained 
        • changes are propagated downstream 
    • materializable 
      • its output can be persisted to OneLake by saving it a delta table into a lakehouse
      • {misuse} data is persisted unnecessarily
    • {action} elevate metric
      • copies measure's definition and description
      • ⇒  implies restructuring, refactoring, moving, and testing a lot of code in the process
      • {misuse} data professionals build everything as metrics
    • {action} update metric
    • {action} add filters to metric 
    • {action} add dimensions to metric
    • {action} materialize metric 

Acronyms:
SSoT - single source of truth ()

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

06 May 2024

Microsoft Fabric: The Metrics Layer [new feature]

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link

18 April 2024

Data Warehousing: Data(base) Mirroring in Microsoft Fabric (New feature)

Data Warehousing
Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon. 

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work. 

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed. 

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done. 

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]. 

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

Microsoft Fabric: Mirroring (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 18-Apr-2024

Mirroring in Microsoft Fabric
Mirroring in Microsoft Fabric (adapted after [2])

Mirroring

  • low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]
    • supported in Azure SQL Database, Azure Cosmos DB, Snowflake [1]
    • ⇒ replaces ETL pipelines
    • ⇐ manages the replication of data into OneLake and conversion to Parquet [1]
    • changes have a near real-time latency [3]
    • uses the source database’s CDC feature [4]
  • {benefit} allows users to use a highly integrated, end-to-end, and easy-to-use service [1]
    • the mirrored data can be used in Power BI
      • ⇐ since tables are all v-ordered delta tables [3]
  • {benefit} the delta tables can then be used in every Fabric experience, allowing users to accelerate their journey into Fabric [1]
    • {restriction} replication of views is currently not supported [3]
  • {benefit} allows to write cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of Lakehouses in a single T-SQL query [1]
  • landing zone in OneLake 
    • stores both the snapshot and change data [3]
      • ⇐ improves performance when converting files into delta verti-parquet [3]
  • {operation} enable the mirroring 
    • by creating a secure connection to the operational data source
    • an entire database or individual tables can be replicated 
    • creates three items in the targeted workspace:
      • the mirrored database
      • a SQL analytics endpoint
      • a Default semantic model
    • the same source database can be mirrored multiple times
      • ⇐ though usually not needed
        • scenario: replicating data to different types of environments
  • {operation} stopping the mirroring
    • stops the replication in the source database, but a copy of the tables is kept in OneLake [3]
  • {operation} restarting the mirroring 
    • results in all data being replicated from the start [3]
  • {operation} remove a table from mirroring
    • the table is no longer replicated and its data is deleted from OneLake [3]
  • {operation} sharing
    • users grant other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1]
      • access is also granted to the SQL analytics endpoint and associated default semantic model [1]
      • shared mirrored databases can be found through either
        • Data Hub
        • Shared with Me section in Microsoft Fabric
    • triggers an initial replication
  • {requirement} [licensing] requires Power BI Premium, Fabric Capacity, or Trial Capacity
  • {feature} monitoring
    • {benefit} allows to gain insights into mirroring operations and when the replica in Fabric OneLake was last refreshed [4]
  • [Azure SQL Database] 
    • {restriction} doesn't support Azure SQL Database logical servers behind an AVN or private networking [2]
      • {requirement} update the Azure SQL logical server firewall rules to Allow public network access [2]
      • {requirement} enable the Allow Azure services option to connect to your Azure SQL Database logical server [2]
    • {restriction}
      •  access through the Power BI Gateway or behind a firewall is unsupported [3]
    • authentication
      • SQL authentication with user name and password
      • Microsoft Entra ID
      • Service Principal
    • {troubleshooting} check if the changes properly flow
      • via sys.dm_change_feed_log_scan_sessions DMV
    • {troubleshooting} check if there are any problems reported
      • via sys.dm_change_feed_errors DMV
    • {troubleshooting} check if the mirroring was properly enabled
      • via sp_help_change_feed stored procedure 
    • {troubleshooting} disable mirroring
      • via sp_change_feed_disable_db stored procedure
Acronyms:
AVN - Azure Virtual Network
CDC - Change Data Capture
ETL - Extract, Transfer, Load
DMV - Dynamic Management View

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2023) (link)

Resources:
[1] Tutorial: Configure Microsoft Fabric mirrored databases from Azure SQL Database [Preview] (link)
[2] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

13 April 2024

R Language: Using the lessR Package in Microsoft Fabric's Notebooks (Test Drive)

I've started to use again the R languages for data visualizations. Discovering the lessR package, which simplifies considerably the verbose syntax of the R language by encapsulating the functionality behind simple functions, I wondered whether it can be installed in Microsoft Fabric and used from notebooks. Besides the available documentation, for learning I used also David W Berging's book on R visualizations [3].

Into a new notebook, I used one cell for each installation or package retrieval (see [1], [2]):

#installing packages
#install.packages("tidyverse") #is in Microsoft Fabric preinstalled
install.packages("lessR")

#retrieve packages from library
library("tidyverse")
library("lessR")

I attempted to read the data from a http location via the lessR Read function and it worked 

d <- Read("http://lessRstats.com/data/employee.xlsx")

head(d)

However, attempting to use any of the lessR functions used for visualization displayed only the text output and not the visualizations. No matter what I did - suppressing the text, suppressing the generation of PDF files, the result was the same. It seems to be a problem with the output device, though I'm not sure how to solve this yet. 

# supressing the text
style(quiet=TRUE)

# reenabling the text
style(quiet=FALSE)

# supressing PDF  generation
pdf(NULL)

# retrieving current device used 
options()$device

I was able to run the ggplot2 scripts from [3] though only when the lessR was also installed (each script should be run in its own cell, otherwise only the last plot is shown):

# bard charts 
ggplot(d) + geom_bar(aes(Dept)) 

# histogram
ggplot(d, aes(Salary)) + geom_histogram(binwidth=10000) 

# integrated violin/box/scatterplot
ggplot(d, aes(x="", y=Salary)) +
geom_violin(fill="gray90", bw=9500, alpha=.3) +
geom_boxplot(fill="gray75", outlier.color="black", width=0.25) +
geom_jitter(shape=16, position=position_jitter(0.05)) +
theme(axis.title.y=element_blank()) +
coord_flip()

# enhanced scatterplot 
ggplot(d, aes(Years, Salary)) + geom_point() +
geom_smooth(method=lm, color="black") +
stat_ellipse(type="norm") +
geom_vline(aes(xintercept=mean(Years, na.rm=TRUE)), color="gray70") +
geom_hline(aes(yintercept=mean(Salary), na.rm=TRUE), color="gray70")

Similar results could be obtained by using the following lessR syntax in RStudio:

# bard charts 
BarChart(Dept)

# histogram
Histogram(Salary) 

# integrated violin/box/scatterplot
Plot(Salary)

# enhanced scatterplot 
Plot(Years, Salary, enhance=TRUE)

Trying to see whether I can access the data from a lakehouse via SparkR, I've downloaded the file from the support website [3], loaded the data into an available lakehouse (e.g. UAT), respectiveley loaded the data to a new table:

-- creating the table
CREATE TABLE [dbo].[employee](
	[Name] [varchar](8000) NULL,
	[Years] [int] NULL,
	[Gender] [varchar](8000) NULL,
	[Dept] [varchar](8000) NULL,
	[Salary] [float] NULL,
	[JobSat] [varchar](8000) NULL,
	[Plan] [int] NULL,
	[Pre] [int] NULL,
	[Post] [int] NULL
) ON [PRIMARY]
GO

-- checking the data
SELECT *
FROM [dbo].[employee]

I was able to access the content of the imported file via the following script:

#access the file from lakehouse
#csv_file <- "https://onelake.dfs.fabric.microsoft.com/<file_system>/<account_name>/Files/OpenSource/employee.csv"
#csv_file <- "abfss://<file_system>.dfs.fabric.microsoft.com/<account_name>/Files/OpenSource/employee.csv"

csv_file <- "Files/OpenSource/employee.csv"

df <- read.df(csv_file, source= "csv", header = "true", inferSchema = "true")

display(df)

Initially, I wasn't able to access the table directly, though in the end I was able to retrieve the data (without and with the catalog's name):

# creating a data frame via SparkSQL
dfEmp <- sql("SELECT * FROM Employee")

head(dfEmp)

Comments:
1) Once the sessions timeout, it seems that one needs to rerun the scripts, which proves to be time-consuming as the installation takes about 5 minutes. 
2) Being able to use lessR directly in Microsoft Fabric could be a real win given its simple syntax. I run most of the tests from the book [3] plus some of the recommended scripts and the results are satisfactory. 
3) The connection via the ABFS path to the lakehouse works as well, but not via URL. 

References:
[1] Microsoft Learn - Microsoft Fabric (2023) R library management(link)
[2] lessR (2024) Data (link)
[3] David W Gerbing (2020) R Visualizations: Derive Meaning from Data
[4] CRAN=R (2024) Package lassR (link)

10 April 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Import data from CSV Files into Delta Table

Microsoft provides a set of labs and exercises that can be used to learn working with data in Fabric, however the real learning comes when one considers an example that introduces something new. As I've downloaded some time ago an archive with several datasets on Sales forecast from the Kaggle website, I tried to import the Features dataset in different ways and see how it goes.

Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
1,2010-02-05,42.31,2.572,NA,NA,NA,NA,NA,211.0963582,8.106,FALSE
1,2010-02-12,38.51,2.548,NA,NA,NA,NA,NA,211.2421698,8.106,TRUE
1,2010-02-19,39.93,2.514,NA,NA,NA,NA,NA,211.2891429,8.106,FALSE
1,2010-02-26,46.63,2.561,NA,NA,NA,NA,NA,211.3196429,8.106,FALSE
1,2010-03-05,46.5,2.625,NA,NA,NA,NA,NA,211.3501429,8.106,FALSE

Within an existing lakehouse, one can import the CSV as it is via 'Files/Upload' and once the data is imported, once can navigate to the file and use 'Load to Tables/New Table' to important the data into a managed table. Unfortunately, because some of the numeric fields include also literal values "NA" for the columns with NULLs, their data type is considered as varchar(8000), which is not ideal for calculations:

-- table created via Load to Tables
CREATE TABLE [dbo].[walmart_features](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [float] NULL,
	[Fuel_Price] [float] NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]

This could be fixed by replacing the NA values with an empty value, which I did and used this version for the next steps. 

I tried then using Spark to import the data, though then all the fields are defined as varchar(8000).

-- table created via Spark
CREATE TABLE [dbo].[walmart_features2](
	[Store] [varchar](8000) NULL,
	[Date] [varchar](8000) NULL,
	[Temperature] [varchar](8000) NULL,
	[Fuel_Price] [varchar](8000) NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [varchar](8000) NULL
) ON [PRIMARY]
GO

So, is needed to define the schema explicitly, however I had to import the IsHoliday as string and cast the value explicitly to a Boolean using a second data frame (see alternatives):

from pyspark.sql.types import *
from pyspark.sql.functions import *

#define schema
featuresSchema = StructType([
      StructField("Store", IntegerType())
    , StructField("Date", DateType())
    , StructField("Temperature",  DecimalType(13,2))
    , StructField("Fuel_Price", DecimalType(13,2))
    , StructField("MarkDown1", DecimalType(13,2))
    , StructField("MarkDown2", DecimalType(13,2))
    , StructField("MarkDown3", DecimalType(13,2))
    , StructField("MarkDown4", DecimalType(13,2))
    , StructField("MarkDown5", DecimalType(13,2))
    , StructField("CPI", DecimalType(18,6))
    , StructField("Unemployment", DecimalType(13,2))
    , StructField("IsHoliday", StringType())
])

# Load a file into a dataframe
df = spark.read.load('Files/OpenSource/features2.csv'
    , format='csv'
    , schema = featuresSchema
    , header=True)

# do the conversion for isHoliday
df2 = df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType())) # Save the dataframe as a delta table df2.write.format("delta").saveAsTable("walmart_features3")

Now, table's definition looks much better:

-- table created via Spark with explicit schema
CREATE TABLE [dbo].[walmart_features3](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [decimal](13, 2) NULL,
	[Fuel_Price] [decimal](13, 2) NULL,
	[MarkDown1] [decimal](13, 2) NULL,
	[MarkDown2] [decimal](13, 2) NULL,
	[MarkDown3] [decimal](13, 2) NULL,
	[MarkDown4] [decimal](13, 2) NULL,
	[MarkDown5] [decimal](13, 2) NULL,
	[CPI] [decimal](18, 6) NULL,
	[Unemployment] [decimal](13, 2) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]
GO

Comments:
(1) I tried to apply the schema change directly on the initial data frame, though the schema didn't change:

df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType()))

(2) For the third method one could have left the NA in because by the conversion a NULL will be considered. Conversely, it might be needed to check if there are other values that fail the conversion. 

(3) The following warning in the Notebook when running the above code is a hint that something went wrong during the conversion (e.g. decimals were cut): 

"Your file(s) might include corrupted records"

(4) Especially for the transformed values it makes sense to look at the values (at last when the dataset isn't too big):

-- validating the values for the Boolean data field
SELECT IsHoliday
, count(*) NoRecords
FROM dbo.walmart_features3
GROUP BY IsHoliday
ORDER BY 1

(5) The tables can be deleted directly in the lakehouse or via PySpark (observe the catalog.table_name):

#dropping the table
spark.sql('DROP TABLE IF EXISTS UAT.walmart_features3') 

At the beginning probably it makes sense to remove "IF EXISTS" to make sure that the table is available.

(6) For those who run into a similar issue, the SQL Endpoint for the lakehouse is read only, therefore attempting to delete a table via SSMS will result in such an error:

Drop failed for Table 'dbo.walmart_features2'.  (Microsoft.SqlServer.Smo)
The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Drop' was denied on the requested resource.
Cannot drop the table 'walmart_features2', because it does not exist or you do not have permission.

Happy coding!

06 April 2024

Microsoft Fabric: Data Governance (Notes)

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 23-May-2024

[Microsoft Fabric] Data Governance

  • {definition}set of capabilities that help organizations to manage, protect, monitor, and improve the discoverability of data, so as to meet data governance (and compliance) requirements and regulations [2]
  • several built-in governance features are available to manage and control the data within Fabric (MF)  [1]
  • {feature} endorsement [aka content endorsement
    • {definition} formal process performed by admins to endorse MF items
    • {benefit} allows admins to designate specific MF items as trusted and approved for use across the organization [1]
      • establishes trust in data assets by promoting and certifying specific MF items [1]
        • users know which assets they can trust and rely on for accurate information [1]
      • endorsed assets are identified with a badge that indicates they have been reviewed and approved [1]
    • {scope} applies to all MF items except dashboards [1]
    • {benefit} helps admin manage the overall growth of items across your environment [1]
  • {feature} promoting [aka content promoting
    • {definition} formal process performed by contributors or admins to promote content
    • promoted content appears with a Promoted badge in the MF portal [1]
      • workspace members with the contributor or admin role can promote content within a workspace [1]
      • MF admin can promote content across the organization [1]
  • {feature} certification [aka content certification]
    • {definition} formal process that involves a review of the content by a designated reviewer and managed by the admin [1]
      • can be customized to meet organization’s needs [1]
      • users can request item certification from an admin [1]
        • via Request certification from the More menu [1]
      • the certified content appears with a Certified badge in the Fabric portal [1]
    • {benefit} allows organizations to label items considered to be quality items [1]
      • an organization can certify items to identify them an as authoritative sources for critical information [1]
        • ⇐ all Fabric items except Power BI dashboards can be certified [1]
    • {benefit} allows to specify certifiers who are experts in the domain [1]
    • domain level settings
      • enable or disable certification of items that belong to the domain [1]
    • provides a URL to documentation that is relevant to certification in the domain [1]
  • {feature} tenant (aka Microsoft Fabric tenant, MF tenant)
    • a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
    • can contain any number of workspaces
  • {feature} workspaces
    • {definition} a collection of items that brings together different functionality in a single environment designed for collaboration
    • can be assigned to teams or departments based on governance requirements and data boundaries [2]
    • are associated with domains [3]
      • ⇐ {benefit} allows to group data into business domains
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [3]
        • ⇐ {benefit} enables a better consumption experience [1]
        • {benefit} enables better discoverability and governance [2]
  • {feature} domains [Notes]
    • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • allows to group data by business domains
      • ⇒{benefit} allows business domains to manage their data according to their specific regulations, restrictions, and needs [3]
    • {feature} subdomains
      • {definition} a way for fine tuning the logical grouping data under a domain [1]
        • ⇐ subdivisions of a domain
  • {feature} labeling
    • default labeling, label inheritance, and programmatic labeling, 
    • {benefit} help achieve maximal sensitivity label coverage across MF [2]
    • once labeled, data remains protected even when it's exported out of MF via supported export paths [2]
    • [Purview Audit] compliance admins can monitor activities on sensitivity labels
  • {feature|preview} folders
    • {definition} a way of logically grouping MF items
  • {feature|preview} tags
    • {benefit} allow managing Fabric items for enhanced compliance, discoverability, and reuse
  • {feature} scanner API
    • a set of admin REST APIs 
    • {benefit} allows to scan MF items for sensitive data [1]
    • can be used to scan both structured and unstructured data [1]
    • {concept} metadata scanning
      • facilitates governance of data by enabling cataloging and reporting on all the metadata of organization's Fabric items [1]
      • it needs to be set up by Admin before metadata scanning can be run [1]
  • {concept} data lineage
    • {definition} 
    • {benefit} allows to track the flow of data through Fabric [1]
    • {benefit} allows to see where data comes from, how it's transformed, and where it goes [1]
    • {benefit} helps understand the data available in Fabric, and how it's being used [1]
  • {concept} Fabric item (aka MF item)
    • {definition} a set of capabilities within an experience
      • form the building blocks of the Fabric platform
    • {type} data warehouse
    • {type} data pipeline
    • {type} semantic model
    • {type} reports
    • {type} dashboards
    • {type} notebook
    • {type} lakehouse
    • {type} metric set

Acronyms:
API - Application Programming Interface
MF - Microsoft Fabric

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.