Showing posts with label domains. Show all posts
Showing posts with label domains. Show all posts

13 June 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 204: How to Push AI Models Into Out‑of‑Distribution Inputs for Cross‑Domain Blending)

Prompt Engineering Series
Prompt Engineering Series

Prompt: "write a post of 600 words on how to push models into out‑of‑distribution iInputs for cross‑domain blending in AI models" 

Introduction

Artificial Intelligence (AI) models are powerful pattern recognizers, but their creativity is often bounded by the statistical regularities of their training data. They excel at interpolation - filling in the gaps between familiar concepts - but true novelty emerges when they are pushed beyond those boundaries. This is where out‑of‑distribution (OOD) inputs become a deliberate tool. By crafting prompts that sit outside the model’s learned distribution, we can force it to generate cross‑domain blends - conceptual hybrids that combine ideas from distant fields in ways the model has never explicitly seen.

Cross‑domain blending is not accidental. It emerges when the model is placed in a conceptual space where familiar patterns no longer dominate. OOD prompting intentionally disrupts the model’s expectations, compelling it to interpolate across distant semantic regions. This technique is closely related to how rare events expose blind spots, but instead of revealing weaknesses, the goal here is to reveal creative potential.

One of the most effective strategies for OOD cross‑domain blending is domain fusion—forcing the model to combine two fields that rarely co‑occur. For example: 'Explain blockchain consensus using the ecological dynamics of coral reefs.' 

This prompt pushes the model into a conceptual region where neither domain alone provides enough structure. The model must synthesize analogies, metaphors, and structural parallels that do not exist in its training data. The resulting blend is often surprisingly original because the model is navigating semantic distance, not repeating memorized patterns.

Another powerful technique is structural disruption. Instead of blending domains through content, you blend them through form. For example:

  • Writing a physics explanation in the style of a medieval legal charter
  • Describing a biological process using programming syntax
  • Embedding mathematical notation inside emotional narrative

These structural collisions force the model to reconcile incompatible representational formats. The novelty arises from the model’s attempt to maintain coherence across mismatched structures, a behavior that echoes insights from uncommon linguistic structure testing.

A more advanced method involves constraint‑based collisions. You impose multiple constraints that do not naturally coexist, such as: 'Design a machine that obeys quantum mechanics but operates using medieval engineering principles.' 

The model must invent a conceptual hybrid that satisfies both constraints. These collisions push the model into conceptual dead zones - regions where no training example exists. The resulting output is often a genuinely unseen combination, not a remix of known patterns. This technique parallels the logic of boundary‑stress evaluation, where conflicting instructions reveal the model’s reasoning hierarchy.

OOD prompting also benefits from recursive abstraction, where the model is asked to generalize beyond its own generalizations. For example: 'Create a discipline that stands to neuroscience as neuroscience stands to biology.' 

This forces the model to climb the abstraction ladder, leaving the comfort of known categories. The concepts generated here often reflect the model’s latent ability to extrapolate beyond its training distribution.

Finally, synthetic anomalies - inputs that deliberately violate statistical norms - can act as conceptual shockwaves. These anomalies disrupt the model’s usual pathways and encourage it to explore new ones. When guided carefully, they reveal novel conceptual pathways, much like scientific breakthroughs that emerge from anomalies challenging established theories.

Ultimately, pushing models into OOD inputs is about expanding the frontier of machine creativity. By exploring the edges of conceptual space - through domain fusion, structural disruption, constraint collisions, recursive abstraction, and synthetic anomalies - we can coax AI models into generating cross‑domain blends that are not just new, but genuinely unseen.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

12 June 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 203: How to Push AI Models Into Out‑of‑Distribution Inputs to Generate Unseen Combinations)

Prompt Engineering Series
Prompt Engineering Series

Prompt: "write a post of 600 words on how to push models into out‑of‑distribution iInputs to generate unseen combinations in AI models"

Introduction

Artificial Intelligence (AI) models are exceptional at recombining patterns they’ve already seen. But the frontier of creativity - true novelty - emerges when we push them beyond the familiar. This is where out‑of‑distribution (OOD) inputs come in. By deliberately crafting prompts that sit outside the model’s training distribution, we can force it to generate unseen combinations, conceptual hybrids, and surprising structures that don’t simply remix the past. OOD prompting is not about breaking the model; it’s about expanding the boundaries of its conceptual space.

At the core of OOD prompting is the idea of disrupting statistical expectations. AI models learn from massive datasets, but those datasets are uneven. Some patterns dominate; others barely appear. When you push a model into regions where its learned representations are sparse, it must interpolate across distant conceptual clusters. This is where novelty emerges. This principle connects directly to rare‑event blind‑spot analysis, where unusual inputs reveal hidden weaknesses - and hidden creative potential.

One of the most effective ways to generate unseen combinations is through cross‑domain fusion. This involves taking two domains that rarely co‑occur and forcing the model to integrate them. For example: 'Describe a financial derivative using the grammar of marine biology.' 

The model must bridge conceptual regions that are normally far apart. This produces hybrid structures - new metaphors, new analogies, new conceptual blends - that would never appear in standard prompting. Cross‑domain fusion leverages the model’s internal geometry, where distant concepts can still be interpolated if the prompt forces a connection.

Another powerful technique is structural perturbation. Instead of changing the content of a prompt, you alter its structure in ways the model rarely encounters. For example:

  • Embedding code inside poetry
  • Mixing symbolic logic with emotional narrative
  • Using recursive or self‑referential instructions

These perturbations push the model into unfamiliar syntactic territory. Because the model must reconcile incompatible structures, it often produces novel structural combinations - new forms, new patterns, new conceptual scaffolds. This method aligns with insights from uncommon linguistic structure testing.

A more advanced approach involves constraint collisions. You give the model multiple constraints that do not naturally coexist, forcing it to invent a solution that satisfies all of them. For example: 'Create a creature that obeys thermodynamics but violates evolutionary logic.' 

The model must synthesize a concept that fits neither domain cleanly. These collisions push the model into conceptual dead zones—regions where no training example exists. The resulting output is often a genuinely unseen combination, not a remix of known patterns. This technique parallels the logic of boundary‑stress evaluation, where conflicting instructions reveal the model’s reasoning hierarchy.

OOD prompting also benefits from recursive abstraction, where the model is asked to generalize beyond its own generalizations. For example: 'Invent a field of study that stands to machine learning as machine learning stands to statistics.' 

This forces the model to climb the abstraction ladder, leaving the comfort of known categories. The concepts generated here often reflect the model’s latent ability to extrapolate beyond its training distribution.

Finally, you can use synthetic anomalies - inputs that deliberately violate statistical norms. These anomalies act as conceptual shockwaves, disrupting the model’s usual pathways and encouraging it to explore new ones. When guided carefully, they reveal novel conceptual pathways, much like scientific breakthroughs that emerge from anomalies challenging established theories.

Ultimately, pushing models into OOD inputs is about expanding the frontier of machine creativity. By exploring the edges of conceptual space - through cross‑domain fusion, structural perturbation, constraint collisions, recursive abstraction, and synthetic anomalies - we can coax AI models into generating combinations that are not just new, but genuinely unseen.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

11 June 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 202: How Pushing AI Models Into Out‑of‑Distribution Inputs Generates Novel Concepts)

Prompt Engineering Series

Prompt: "write a post of 600 words on how to push models into out‑of‑distribution iInputs to generate novel concepts in AI models"

Introduction

Artificial Intelligence (AI) models excel at interpolation - filling in the gaps between patterns they’ve already seen. But the frontier of creativity, innovation, and conceptual discovery lies outside those familiar boundaries. To reach that frontier, researchers use out‑of‑distribution (OOD) inputs: prompts, structures, or data patterns that sit beyond the model’s training distribution. When done intentionally and safely, this technique can reveal how models generalize, how they stretch their internal representations, and how they generate novel concepts that do not simply remix the past.

Pushing a model into OOD territory is not about confusing it. It’s about stress‑testing its conceptual elasticity. Models trained on massive datasets develop dense clusters of meaning - regions where concepts are richly represented - and sparse regions where the model has little experience. OOD inputs target those sparse regions. They force the model to navigate conceptual space without the usual statistical anchors, revealing how it constructs meaning when familiar patterns disappear. This connects directly to rare‑event blind‑spot analysis, where unusual inputs expose hidden weaknesses.

One powerful method for generating OOD conditions is structural perturbation. Instead of changing the content of a prompt, researchers alter its structure - using unusual syntax, hybrid formats, or nested instructions. For example, combining mathematical notation with poetic metaphor, or embedding code inside rhetorical questions. These hybrid structures push the model into regions where its learned representations overlap in unexpected ways. The model must reconcile incompatible patterns, often producing emergent conceptual blends that would not appear in standard prompting. This technique aligns with insights from uncommon linguistic structure testing.

Another approach involves semantic displacement - asking the model to apply concepts from one domain to another where they do not naturally belong. For example: 'Describe quantum entanglement using the logic of medieval guild economics.' This forces the model to map distant conceptual regions together, creating novel analogies or frameworks. These mappings are not random; they reveal how the model organizes knowledge internally. When the model is pushed far enough, it begins to generate new conceptual hybrids, not because it has seen them before, but because its internal geometry allows it to interpolate across distant domains.

A more advanced technique uses contradictory task layering, where the model must satisfy overlapping constraints that do not naturally coexist. For example: 'Invent a biological organism that obeys thermodynamic laws but violates known evolutionary principles.' These prompts push the model into conceptual dead zones - regions where no training example exists. The model must synthesize new structures to satisfy the constraints, often producing novel theoretical constructs. This method parallels the logic of boundary‑stress evaluation, where conflicting instructions reveal the model’s reasoning hierarchy.

OOD prompting also benefits from recursive abstraction, where the model is asked to generalize beyond its own generalizations. For instance: 'Generate a concept that is to machine learning what machine learning is to statistics.' This forces the model to climb the abstraction ladder, leaving the comfort of known categories. The resulting concepts often reflect the model’s latent ability to extrapolate beyond its training distribution.

Finally, OOD exploration can involve synthetic anomalies - inputs that deliberately violate the statistical norms of the training data. These anomalies act as conceptual shockwaves, pushing the model to reorganize its internal representations. When guided carefully, they can reveal new conceptual pathways, much like how scientific breakthroughs often emerge from anomalies that challenge existing theories.

Ultimately, pushing models into OOD inputs is not about breaking them. It is about discovering the edges of their conceptual space. By exploring those edges, researchers can uncover how models generalize, how they innovate, and how they generate ideas that go beyond the sum of their training data. OOD prompting is a tool for expanding the frontier of machine creativity - one carefully engineered anomaly at a time.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

07 June 2026

🤖〽️Prompt Engineering: Copilot Unabridged (Part 198: How Domain‑Specific Anomalies Expose Blind Spots in AI Models)

Prompt Engineering Series
Prompt Engineering Series

Prompt: "write a post of 600 words on how domain‑specific anomalies expose blind spots in AI models"

Introduction

Artificial Intelligence (AI) models are often praised for their versatility, but their real limitations become visible only when they step outside the comfort zone of general‑purpose language. When a model encounters domain‑specific anomalies - the unusual patterns, edge‑case behaviors, or irregular structures that appear only within a particular field - it is forced to operate without the statistical safety net it relies on. These anomalies act like diagnostic probes, revealing blind spots that remain hidden during everyday interactions.

To understand why domain‑specific anomalies are so revealing, you have to consider how AI models learn. They absorb patterns from massive datasets, but those datasets are never evenly distributed across all fields. Some domains - like everyday conversation, news, or common technical topics - are heavily represented. Others - like niche scientific notation, legal edge cases, rare medical conditions, or obscure programming paradigms—appear only sparsely. This imbalance creates statistical shadows, areas where the model’s internal representation is thin or incomplete.

When an anomaly appears inside one of these shadows, the model’s behavior becomes a window into its internal reasoning. For example, a model trained heavily on mainstream medical literature may perform well on common diagnoses but struggle when confronted with a rare syndrome or an atypical symptom cluster. The model may latch onto the wrong cue, misinterpret the structure of the description, or default to generic reasoning. These failures expose the over‑generalization that occurs when a model tries to stretch familiar patterns into unfamiliar territory.

Domain‑specific anomalies also reveal how models handle specialized linguistic structures. Fields like law, mathematics, chemistry, and finance each have their own micro‑languages - dense with symbols, conventions, and implicit assumptions. When an anomaly disrupts these conventions, the model must decide which cues to trust. A misplaced operator in a mathematical expression, an unusual clause ordering in a legal contract, or a non‑standard chemical notation can cause the model to misread the entire structure. These moments show where the model’s understanding is superficial, echoing the challenges seen in uncommon linguistic structures.

Another revealing category involves procedural anomalies - cases where a domain has strict rules, and the anomaly breaks them. In programming, for example, a function that violates typical naming conventions or a code block that mixes paradigms can confuse the model’s internal heuristics. In finance, an unusual transaction pattern may cause the model to misclassify risk. In scientific writing, a non‑standard experimental layout may lead the model to misinterpret the methodology. These anomalies expose the model’s reliance on pattern familiarity rather than true conceptual understanding.

Domain‑specific anomalies also highlight the limits of contextual transfer. A model may perform well when a domain behaves predictably, but when an anomaly forces the model to transfer knowledge across contexts - such as applying physics reasoning to a biological edge case - it may reveal gaps in its internal conceptual map. These gaps often align with the same vulnerabilities uncovered through weak‑point mapping, where the model over‑trusts certain cues simply because they dominate the training distribution.

Perhaps the most important insight is that domain‑specific anomalies expose hidden assumptions baked into the model. Every domain has its own logic, and models often internalize simplified versions of that logic. When an anomaly violates those assumptions, the model’s response shows how rigid or flexible its internal representation truly is. A well‑aligned model adapts; a brittle one collapses into generic or incorrect reasoning.

Ultimately, domain‑specific anomalies are not just edge cases - they are stress tests that reveal the contours of an AI model’s understanding. They show where the model is robust, where it is brittle, and where its blind spots lie. By studying these anomalies, researchers can build models that are not only more capable, but also more transparent, predictable, and aligned with the complexity of real‑world domains.

Disclaimer: The whole text was generated by Copilot (under Windows 11) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

23 May 2024

🏭🗒️Microsoft Fabric: Domains [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 29-May-2024

Domains & Entities

[Microsoft Fabric] Domains

  • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • associated with workspaces 
      • {benefit} allows to group data into business domains [1]
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [1]
      • {benefit} enables a better consumption experience [1]
        • simplify discovery and consumption 
  • provide a management boundary between tenant and workspace enabling domain admins to have more granular control over multiple workspaces [6]
    • some tenant-level settings for managing and governing data can be delegated to the domain level [2]
  • allow to achieve federated governance [7]
    • by delegating settings to domain admins
      • ⇒ allow provide more granular control over business area [7]
  • [security] domain roles
    • Fabric admins (or higher)
      • can create and edit domains
      • can specify domain admins and contributors
      • can associate workspaces with domains [4]
      • can see, edit, and delete all domains in the admin portal [4]
      • domain admins
    • business owners or experts of a domain
      • can update the domain description
      • can define contributors
      • can associate workspaces with the domain [4]
      • can define and update the domain image
      • can override tenant settings for any specific settings the tenant admin has delegated to the domain level [4]
      • can't delete the domain, change the domain name, or add/delete other domain admins
      • can only see and edit the domains they're admins of.
    • domain contributors
      • ⇐ must be a workspace admin
      • can associate their workspaces with a domain or change the current domain association
      • don’t have access to the Domains page in the admin portal
    • domain users
      • can share lakehouse with other domain users without giving access to workspace and other artifacts [4]
  • {concept} default domain
    • a domain that has been specified as the default domain for specific users and/or security groups [3]
      • ⇒ when these users/security groups create/update a new/unassigned workspace, that workspace will automatically be assigned to that domain [3]
      • ⇒ generally automatically become domain contributors of the workspaces that are assigned in this manner [3]
  • {feature} subdomains
    • a way for fine tuning the logical grouping data under a domain [1]
      • subdivisions of a domain
      • only one level is supported in the hierarchy
    • visible as part of the domains filter and as part of the item location path
    • no setup available
      • [planned] some domain settings will be added to subdomains as well

References:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link
[5] Microsoft Fabric Updates Blog (2024) Easily implement data mesh architecture with domains in Fabric, by Naama Tsafrir 
(link)
[6] Microsoft (2024) Microsoft Fabric Domains – Data Mesh [with Naama Tsafrir & Assaf Shemesh]
[7] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]
[8] Microsoft Learn (2025) Tags in Microsoft Fabric [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Microsoft Learn (2025) Best practices for planning and creating domains in Microsoft Fabric [link]

Acronyms:
MF - Microsoft Fabric

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer) 🆕

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link
[14] KratosBI (2024) Introducing the Metrics Layer in #MicrosoftFabric with Carly Newsome [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

06 April 2024

🏭🗒️Microsoft Fabric: Data Governance [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 23-May-2024

[Microsoft Fabric] Data Governance

  • {definition}set of capabilities that help organizations to manage, protect, monitor, and improve the discoverability of data, so as to meet data governance (and compliance) requirements and regulations [2]
  • several built-in governance features are available to manage and control the data within Fabric (MF)  [1]
  • {feature} endorsement [aka content endorsement
    • {definition} formal process performed by admins to endorse MF items
    • {benefit} allows admins to designate specific MF items as trusted and approved for use across the organization [1]
      • establishes trust in data assets by promoting and certifying specific MF items [1]
        • users know which assets they can trust and rely on for accurate information [1]
      • endorsed assets are identified with a badge that indicates they have been reviewed and approved [1]
    • {scope} applies to all MF items except dashboards [1]
    • {benefit} helps admin manage the overall growth of items across your environment [1]
  • {feature} promoting [aka content promoting
    • {definition} formal process performed by contributors or admins to promote content
    • promoted content appears with a Promoted badge in the MF portal [1]
      • workspace members with the contributor or admin role can promote content within a workspace [1]
      • MF admin can promote content across the organization [1]
  • {feature} certification [aka content certification]
    • {definition} formal process that involves a review of the content by a designated reviewer and managed by the admin [1]
      • can be customized to meet organization’s needs [1]
      • users can request item certification from an admin [1]
        • via Request certification from the More menu [1]
      • the certified content appears with a Certified badge in the Fabric portal [1]
    • {benefit} allows organizations to label items considered to be quality items [1]
      • an organization can certify items to identify them an as authoritative sources for critical information [1]
        • ⇐ all Fabric items except Power BI dashboards can be certified [1]
    • {benefit} allows to specify certifiers who are experts in the domain [1]
    • domain level settings
      • enable or disable certification of items that belong to the domain [1]
    • provides a URL to documentation that is relevant to certification in the domain [1]
  • {feature} tenant (aka Microsoft Fabric tenant, MF tenant)
    • a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
    • can contain any number of workspaces
  • {feature} workspaces
    • {definition} a collection of items that brings together different functionality in a single environment designed for collaboration
    • can be assigned to teams or departments based on governance requirements and data boundaries [2]
    • are associated with domains [3]
      • ⇐ {benefit} allows to group data into business domains
      • all the items in the workspace are then associated with the domain, and they receive a domain attribute as part of their metadata [3]
        • ⇐ {benefit} enables a better consumption experience [1]
        • {benefit} enables better discoverability and governance [2]
  • {feature} domains [Notes]
    • {definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
    • allows to group data by business domains
      • ⇒{benefit} allows business domains to manage their data according to their specific regulations, restrictions, and needs [3]
    • {feature} subdomains
      • {definition} a way for fine tuning the logical grouping data under a domain [1]
        • ⇐ subdivisions of a domain
  • {feature} labeling
    • default labeling, label inheritance, and programmatic labeling, 
    • {benefit} help achieve maximal sensitivity label coverage across MF [2]
    • once labeled, data remains protected even when it's exported out of MF via supported export paths [2]
    • [Purview Audit] compliance admins can monitor activities on sensitivity labels
  • {feature|preview} folders
    • {definition} a way of logically grouping MF items
  • {feature|preview} tags
    • {benefit} allow managing Fabric items for enhanced compliance, discoverability, and reuse
  • {feature} scanner API
    • a set of admin REST APIs 
    • {benefit} allows to scan MF items for sensitive data [1]
    • can be used to scan both structured and unstructured data [1]
    • {concept} metadata scanning
      • facilitates governance of data by enabling cataloging and reporting on all the metadata of organization's Fabric items [1]
      • it needs to be set up by Admin before metadata scanning can be run [1]
  • {concept} data lineage
    • {definition} 
    • {benefit} allows to track the flow of data through Fabric [1]
    • {benefit} allows to see where data comes from, how it's transformed, and where it goes [1]
    • {benefit} helps understand the data available in Fabric, and how it's being used [1]
  • {concept} Fabric item (aka MF item)
    • {definition} a set of capabilities within an experience
      • form the building blocks of the Fabric platform
    • {type} data warehouse
    • {type} data pipeline
    • {type} semantic model
    • {type} reports
    • {type} dashboards
    • {type} notebook
    • {type} lakehouse
    • {type} metric set

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn - Fabric (2024) Governance overview and guidance (link)
[3] Microsoft Learn: Fabric (2023) Fabric domains (link)
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
API - Application Programming Interface
MF - Microsoft Fabric

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

 

Business Intelligence
Business Intelligence Series

Now considering the 500 houses and the skyscraper model introduced in thee previous post, which do you think will be built first? A skyscraper takes 2-10 years to build, depending on the city in which is built and the architecture characteristics. A house may take 6-12 months depending on similar factors. But one needs to build 500 houses. For sure the process can be optimized when the houses look the same, though there are many constraints one needs to consider - the number of workers, tools, and the construction material available at a given time, the volume of planning, etc. 

Within a rough estimate, it can take 2-5 years for each architecture to be built considering that on the average the advantages and disadvantages from the various areas can balance each other out. Historical data are in general needed for estimating the actual development time. One can start with a rough estimate and reevaluate the estimates up and down as more information are gathered. This usually happens in Software Engineering as well. 

Monolith vs. Distributed Architecture
Monolith vs. Distributed Architecture - 500 families

There are multiple ways in which the work can be assigned to the contractors. When the houses are split between domains, each domain can have its own contractor(s) or the contractors can be specialized by knowledge areas, or a combination of the two. Contractors’ performance should be the same, though in practice no two contractors are the same. Conversely, the chances are higher for some contractors to deliver at the expected quality. It would be useful to have worked before with the contractors and have a partnership that spans years back. There are risks on both sides, even if the risks might favor one architecture over the other, and this depends also on the quality of the contractors, designs, and planning. 

The planning must be good if not perfect to assure smooth development and each day can cost money when contractors are involved. The first planning must be done for the whole project and then split individually for each contractor and/or group of buildings. A back-and-forth check between the various plans is needed. Managing by exception can work, though it can also go terribly wrong. 

Lot of communication must occur between domains to make sure that everything fits together. Especially at the beginning, all the parties must plan together, must make sure that the rules of the games (best practices, policies, procedures, processes, methodologies) are agreed upon. Oversight (governance) needs to happen at a small scale as well on aggregate to makes sure that the rules of the game are followed. 

Now, which of the architectures do you think will fit a data warehouse (DWH)? Probably multiple voices will opt for the skyscraper, at least this is how a DWH looks from the outside. However, when one evaluates the architecture behind it, it can resemble a residential complex in which parts are bound together, but there are parts that can be distributed if needed. For example, in a DWH the HR department has its own area that's isolated from the other areas as it has higher security demands. There can be 2-3 other areas that don't share objects, and they can be distributed as well. The reasons why all infrastructure is on one machine are the costs associated with the licenses, respectively the reporting tools point to only one address. 

In data marts based DWHs, there are multiple buildings within the architecture, and thus the data marts can be distributed across a wider infrastructure, with each domain responsible for its own data mart(s). The data marts are by definition domain-dependent, and this is one of the downsides imputed to this architecture. 

Previous Post <<||>> Next Post

17 March 2024

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence
Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content). 

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply. 

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs. 

Data Products with a Data Mesh
Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort. 

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible. 

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. 

Last updated: 17-Mar-2024

Data Products with a Data Mesh
Data Products with a Data Mesh

Data Mesh
  • {definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]
    • ⇐ there is no default standard or reference implementation of data mesh and its components [2]
  • {definition} a type of decentralized data architecture that organizes data based on different business domains [2]
    • ⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
    • distributes the modeling of analytical data, the data itself and its ownership [1]
  • {characteristic} partitions data around business domains and gives data ownership to the domains [1]
    • each domain can model their data according to their context [1]
    • there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]
    • endorses multiple models of the data
      • data can be read from one domain, transformed and stored by another domain [1]
  • {characteristic} evolutionary execution process
  • {characteristic} agnostic of the underlying technology and infrastructure [1]
  • {aim} respond gracefully to change [1]
  • {aim} sustain agility in the face of growth [1]
  • {aim} increase the ratio of value from data to investment [1]
  • {principle} data as a product
    • {goal} business domains become accountable to share their data as a product to data users
    • {goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
    • {goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
    • usability characteristics
  • {principle} domain-oriented ownership
    • {goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
    • {goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
    • {goal} align business, technology and analytical data [1]
  • {principle} self-serve data platform
    • {goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
    • {goal} streamline the experience of data consumers to discover, access, and use the data products [1]
  • {principle} federated computational governance
    • {goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
    • {goal} codifying and automated execution of policies at a fine-grained level [1]
    • ⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]
  • {concept} decentralization of data products
    • {requirement} ability to compose data across different modes of access and topologies [1]
      • data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]
        • many of the existing composability techniques that assume homogeneous data won’t work
          • e.g.  defining primary and foreign key relationships between tables of a single schema [1]
    • {requirement} ability to discover and learn what is relatable and decentral [1]
    • {requirement} ability to seamlessly link relatable data [1]
    • {requirement} ability to relate data temporally [1]
  • {concept} data product 
    • the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
    • provides a set of explicitly defined and data sharing contracts
    • provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
    • constructed in alignment with the source domain [3]
    • {characteristic} autonomous
      • its life cycle and model are managed independently of other data products [1]
    • {characteristic} discoverable
      • via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]
    • {characteristic} addressable
      • via a permanent and unique address to the data user to programmatically or manually access it [1] 
    • {characteristic} understandable
      • involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
      • describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]
    • {characteristic} trustworthy and truthful
      • represents the fact of the business correctly [1]
      • provides data provenance and data lineage [1]
    • {characteristic} natively accessible
      • make it possible for various data users to access and read its data in their native mode of access [1]
      • meant to be broadcast and shared widely [3]
    • {characteristic} interoperable and composable
      • follows a set of standards and harmonization rules that allow linking data across domains easily [1]
    • {characteristic} valuable on its own
      • must have some inherent value for the data users [1]
    • {characteristic} secure
      • the access control is validated by the data product, right in the flow of data, access, read, or write [1] 
        • ⇐ the access control policies can change dynamically
    • {characteristic} multimodal 
      • there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3] 
    • shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
    • {concept} data quantum (aka product data quantum, architectural quantum) 
      • unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]
        • {component} data
        • {component} metadata
        • {component} code
        • {component} policies
        • {component} dependencies' listing
    • {concept} data product observability
      • monitor the operational health of the mesh
      • debug and perform postmortem analysis
      • perform audits
      • understand data lineage
    • {concept} logs 
      • immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
      • used for debugging and root cause analysis
    • {concept} traces
      • records of causally related distributed events [1]
    • {concept} metrics
      • objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]
  • artefacts 
    • e.g. data, code, metadata, policies
Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

14 March 2024

🧭Business Intelligence: Architecture (Part I: Monolithic vs. Distributed and Zhamak Dehghani's Data Mesh - Debunked)

Business Intelligence
Business Intelligence Series

In [1] the author categorizes data warehouses (DWHs) and lakes as monolithic architectures, as opposed to data mesh's distributed architecture, which makes me circumspect about term's use. There are two general definitions of what monolithic means: (1) formed of a single large block (2) large, indivisible, and slow to change.

In software architecture one can differentiate between monolithic applications where the whole application is one block of code, multi-tier applications where the logic is split over several components with different functions that may reside on the same machine or are split non-redundantly between multiple machines, respectively distributed, where the application or its components run on multiple machines in parallel.

Distributed multi-tire applications are a natural evolution of the two types of applications, allowing to distribute redundantly components across multiple machines. Much later came the cloud where components are mostly entirely distributed within same or across distinct geo-locations, respectively cloud providers.

Data Warehouse vs. Data Lake vs. Lakehouse [2]

From licensing and maintenance convenience, a DWH resides typically on one powerful machine with many chores, though components can be moved to other machines and even distributed, the ETL functionality being probably the best candidate for this. In what concerns the overall schema there can be two or more data stores with different purposes (operational/transactional data stores, data marts), each of them with their own schema. Each such data store could be moved on its own machine though that's not feasible.

DWHs tend to be large because they need to accommodate a considerable number of tables where data is extracted, transformed, and maybe dumped for the various needs. With the proper design, also DWHs can be partitioned in domains (e.g. define one schema for each domain) and model domain-based perspectives, at least from a data consumer's perspective. The advantage a DWH offers is that one can create general dimensions and fact tables and build on top of them the domain-based perspectives, minimizing thus code's redundancy and reducing the costs.  

With this type of design, the DWH can be changed when needed, however there are several aspects to consider. First, it takes time until the development team can process the request, and this depends on the workload and priorities set. Secondly, implementing the changes should take a fair amount of time no matter of the overall architecture used, given that the transformations that need to be done on the data are largely the same. Therefore, one should not confuse the speed with which a team can start working on a change with the actual implementation of the change. Third, the possibility of reusing existing objects can speed up changes' implementation. 

Data lakes are distributed data repositories in which structured, unstructured and semi-structured data are dumped in raw form in standard file formats from the various sources and further prepared for consumption in other data files via data pipelines, notebooks and similar means. One can use the medallion architecture with a folder structure and adequate permissions for domains and build reports and other data artefacts on top. 

A data lake's value increases when is combined with the capabilities of a DWH (see dedicated SQL server pool) and/or analytics engine (see serverless SQL pool) that allow(s) building an enterprise semantic model on top of the data lake. The result is a data lakehouse that from data consumer's perspective and other aspects mentioned above is not much different than the DWH. The resulting architecture is distributed too. 

Especially in the context of cloud computing, referring to nowadays applications metaphorically (for advocative purposes) as monolithic or distributed is at most a matter of degree and not of distinction. Therefore, the reader should be careful!

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Databricks (2022) Data Lakehouse (link)

13 March 2024

🔖Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021) is a must read book for the data professional. So, here I am, finally managing to read it and give it some thought, even if it will probably take more time and a few more reads for the ideas to grow. Working in the fields of Business Intelligence and Software Engineering for almost a quarter-century, I think I can understand the historical background and the direction of the ideas presented in the book. There are many good ideas but also formulations that make me circumspect about the applicability of some assumptions and requirements considered. 

So, after data marts, warehouses, lakes and lakehouses, the data mesh paradigm seems to be the new shiny thing that will bring organizations beyond the inflection point with tipping potential from where organization's growth will have an exponential effect. At least this seems to be the first impression when reading the first chapters. 

The book follows to some degree the advocative tone of promoting that "our shiny thing is much better than previous thing", or "how bad the previous architectures or paradigms were and how good the new ones are" (see [2]). Architectures and paradigms evolve with the available technologies and our perception of what is important for businesses. Old and new have their place in the order of things, and the old will continue to exist, at least until the new proves its feasibility.  

The definition of the data mash as "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1] is too abstract even if it reflects at high level what the concept is about. Compared to other material I read on the topic, the book succeeds in explaining the related concepts as well the goals (called definitions) and benefits (called motivations) associated with the principles behind the data mesh, making the book approachable also by non-professionals. 

Built around four principles "data as a product", "domain-oriented ownership", "self-serve data platform" and "federated governance", the data mesh is the paradigm on which data as products are developed; where the products are "the smallest unit of architecture that can be independently deployed and managed", providing by design the information necessary to be discovered, understood, debugged, and audited.

It's possible to create Lego-like data products, data contracts and/or manifests that address product's usability characteristics, though unless the latter are generated automatically, put in the context of ERP and other complex systems, everything becomes quite an endeavor that requires time and adequate testing, increasing the overall timeframe until a data product becomes available. 

The data mesh describes data products in terms of microservices that structure architectures in terms of a collection of services that are independently deployable and loosely coupled. Asking from data products to behave in this way is probably too hard a constraint, given the complexity and interdependency of the data models behind business processes and their needs. Does all the effort make sense? Is this the "agility" the data mesh solutions are looking for?

Many pioneering organizations are still fighting with the concept of data mesh as it proves to be challenging to implement. At a high level everything makes sense, but the way data products are expected to function makes the concept challenging to implement to the full extent. Moreover, as occasionally implied, the data mesh is about scaling data analytics solutions with the size and complexity of organizations. The effort makes sense when the organizations have a certain size and the departments have a certain autonomy, therefore, it might not apply to small to medium businesses.

Previous Post <<||>>  Next Post

References:
[1] Zhamak Dehghani (2021) "Data Mesh: Delivering Data-Driven Value at Scale" (link)
[2] SQL-troubles (2024) Zhamak Dehghani's Data Mesh - Monolithic Warehouses and Lakes (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.