Showing posts with label architecture. Show all posts
Showing posts with label architecture. Show all posts

11 June 2024

Business Intelligence: Is Microsoft Fabric Ready? (Part I: Some Random Thoughts)

Business Intelligence Series
Business Intelligence Series

When writing a Business Case, besides the problem and solution(s) high-level descriptions, is important to rougly estimate how much it costs, how long it takes, respectively how many resources are needed and for what activities. A proof-of-concept (PoC) might not need an explicit business case, though the same high-level information is needed at least for the planning of resources and a formal approval.

Given that there are several analytical experiences in Microsoft Fabric (MF), it’s clear that can’t be anymore a reference architecture that can be recommended for customers. Frankly, that ship has sailed even since the introduction of Microsoft Synapse, if not earlier, with the movement to the cloud. Also, there’s no one size fits all as certain building blocks make sense only in certain scenarios (e.g. organization scale, data volume or source’s type). Moreover, even if MF has been generally available for quite some time, customers and service providers ask themselves whether the available features are enough for building analytics solutions based on it. 

“Is Fabric Ready?” was the topic of today’s Explicit Measures webcast [1]. Probably the answer is as usual “it depends” and the general recommendation is to do a PoC to check solution's feasibility. Conversely, MF may be the best approach to consider if integration with other systems (e.g. Dynamics 365, Dataverse) is needed. 

What the customers need are some rough realistic estimates they can base any planning upon (at least for a PoC if not for the whole project) in terms of making the data available into OneLake, building a semantic model, respectively processing and making the data available for consumption. Ideally, one needs a translation of the various steps as done earlier. For example, how long it takes to make the data available in OneLake, how long it takes to move the data physically or logically though the various layers, to build semantic models, etc. 

Probably, some things can be achieved in a matter of days, at least if one knows what one’s doing. However, we are talking here about a new architecture that may resemble for some of an unknow territory. Even if old and new techniques can be mixed, there are further implications or improvements that can be considered. There are many webcasts, blog posts and other material on how to do things, on what’s possible, though building a functioning solution from beginning to the end, even as PoC, requires more than putting all this together. 

Just making the data flow from point A to B or C is not enough - data security, data governance and a few other topics like scalability and availability need to be considered as well. Security and governance are also the areas in which probably more features must be considered. For many customers starting now with MF, the hope is that most of these features will be available during the time the solutions are ready for production.

From a cost perspective, there’s the cost of data at rest, in transit, the licensing for MF and the other components involved. Ideally, one should start small and increase capacities as needed, though small can vary from case to case, while it’s important to find out the optimum. Starting in the middle could be an alternative approach even if may involve higher costs. If one starts small, the costs for PoC can be neglectable, though sooner or later a compromise is needed to provide an acceptable performance. 

In terms of human resources, the topic is more complex (see [2]), and it depends largely on the nature of the project. The pool of skillsets is the most important constraint or enabler such projects can have.

Previous Post <<||>> Next Post

References:
[1] Explicit Measures (2024) Power BI tips Ep.327: Is Fabric Ready? (link)
[2] Explicit Measures (2024) Power BI tips Ep.321: Building and BI Team (link)

06 May 2024

Microsoft Fabric: The Metrics Layer [new feature]

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link

03 April 2024

Business Intelligence: The Top 5 Pains of a BI/Analytics Manager

Business Intelligence Series
Business Intelligence Series

1) Business Strategy

A business strategy is supposed to define an organization's mission, vision, values, direction, purpose, goals, objectives, respectively the roadmap, alternatives, capabilities considered to achieve them. All this information is needed by the BI manager to sketch the BI strategy needed to support the business strategy. 

Without them, the BI manager must extrapolate, and one thing is to base one's decisions on a clearly stated and communicated business strategy, and another thing to work with vague declarations full of uncertainty. In the latter sense, it's like attempting to build castles into thin air and expecting to have a solid foundation. It may work as many BI requirements are common across organizations, but it can also become a disaster. 

2) BI/Data Strategy

Organizations usually differentiate between the BI and the data Strategy because different driving forces and needs are involved, even if there are common goals, needs and opportunities that must be considered from both perspectives. When there's no data strategy available, the BI manager is either forced to address thus many data-related topics (e.g. data culture, data quality, metadata management, data governance), or ignore them with all consequences deriving from this. 

A BI strategy is an extension of the business, data and IT strategies into the BI knowledge areas. Unfortunately, few organizations give it the required attention. Besides the fact that the BI strategy breaks down the business strategy from its perspective, it also adds its own goals and objectives which are ideally aligned with the ones from the other strategies. 

3) Data Culture

Data culture is "the collective beliefs, values, behaviors, and practices of an organization’s employees in harnessing the value of data for decision-making, operations, or insight". Therefore, data culture is an enabler which, when the many aspects are addressed adequately, can have a multiplier effect for the BI strategy and its execution. Conversely, when basic data culture assumptions and requirements aren't addressed, the interrelated issues resulting from this can prove to be a barrier for the BI projects, operations and strategy. 

As mentioned before, an organization’s (data) culture is created, managed, nourished, and destroyed through leadership. If the other leaders aren't playing along, each challenge related to data culture and BI will become a concern for the BI manager.

4) Managing Expectations 

A business has great expectations from the investment in its BI infrastructure, especially when the vendors promise competitive advantage, real-time access to data and insights, self-service capabilities, etc. Even if these promises are achievable, they represent a potential that needs to be harnessed and there are several premises that need to be addressed continuously. 

Some BI strategies and/or projects address these expectations from the beginning, though there are many organizations that ignore or don't give them the required importance. Unfortunately, these expectations (re)surface when people start using the infrastructure and this can easily become an acceptance issue. It's the BI manager's responsibility to ensure expectations are managed accordingly.

5) Building the Right BI Architecture

For the BI architecture the main driving forces are the shifts in technologies from single servers to distributed environments, from relational tables and data warehouses to delta tables and delta lakes built with the data mesh's principles and product-orientation in mind, which increase the overall complexity considerably. Vendors and data professionals' vision of how the architectures of the future will look like still has major milestones and challenges to surpass. 

Therefore, organizations are forced to explore the new architectures and the opportunities they bring, however this involves a considerable effort, skilled resources, and more iterations. Conversely, ignoring these trends might prove to be an opportunity lost and eventually duplicated effort on the long term.

22 March 2024

Business Intelligence: Monolithic vs. Distributed Architecture (Part III: Architectural Applications)

 

Business Intelligence
Business Intelligence Series

Now considering the 500 houses and the skyscraper model introduced in thee previous post, which do you think will be built first? A skyscraper takes 2-10 years to build, depending on the city in which is built and the architecture characteristics. A house may take 6-12 months depending on similar factors. But one needs to build 500 houses. For sure the process can be optimized when the houses look the same, though there are many constraints one needs to consider - the number of workers, tools, and the construction material available at a given time, the volume of planning, etc. 

Within a rough estimate, it can take 2-5 years for each architecture to be built considering that on the average the advantages and disadvantages from the various areas can balance each other out. Historical data are in general needed for estimating the actual development time. One can start with a rough estimate and reevaluate the estimates up and down as more information are gathered. This usually happens in Software Engineering as well. 

Monolith vs. Distributed Architecture
Monolith vs. Distributed Architecture - 500 families

There are multiple ways in which the work can be assigned to the contractors. When the houses are split between domains, each domain can have its own contractor(s) or the contractors can be specialized by knowledge areas, or a combination of the two. Contractors’ performance should be the same, though in practice no two contractors are the same. Conversely, the chances are higher for some contractors to deliver at the expected quality. It would be useful to have worked before with the contractors and have a partnership that spans years back. There are risks on both sides, even if the risks might favor one architecture over the other, and this depends also on the quality of the contractors, designs, and planning. 

The planning must be good if not perfect to assure smooth development and each day can cost money when contractors are involved. The first planning must be done for the whole project and then split individually for each contractor and/or group of buildings. A back-and-forth check between the various plans is needed. Managing by exception can work, though it can also go terribly wrong. 

Lot of communication must occur between domains to make sure that everything fits together. Especially at the beginning, all the parties must plan together, must make sure that the rules of the games (best practices, policies, procedures, processes, methodologies) are agreed upon. Oversight (governance) needs to happen at a small scale as well on aggregate to makes sure that the rules of the game are followed. 

Now, which of the architectures do you think will fit a data warehouse (DWH)? Probably multiple voices will opt for the skyscraper, at least this is how a DWH looks from the outside. However, when one evaluates the architecture behind it, it can resemble a residential complex in which parts are bound together, but there are parts that can be distributed if needed. For example, in a DWH the HR department has its own area that's isolated from the other areas as it has higher security demands. There can be 2-3 other areas that don't share objects, and they can be distributed as well. The reasons why all infrastructure is on one machine are the costs associated with the licenses, respectively the reporting tools point to only one address. 

In data marts based DWHs, there are multiple buildings within the architecture, and thus the data marts can be distributed across a wider infrastructure, with each domain responsible for its own data mart(s). The data marts are by definition domain-dependent, and this is one of the downsides imputed to this architecture. 

Previous Post <<||>> Next Post

Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

Business Intelligence
Business Intelligence Series

One metaphor that can be used to understand the difference between monolith and distributed architectures, respectively between data warehouses and data mesh-based architectures as per Dehghani’s definition [1] - think that you need to accommodate 500 families (the data products to be built). There are several options: (1) build a skyscraper (developing on vertical) (2) build a complex of high buildings and develop by horizontal and vertical but finding a balance between the two; (3) to split (aka distribute) the second option and create several buildings; (4) build for each family a house, creating a village or a neighborhood. 

Monolith vs. Distributed Architecture
Monolith vs. Distributed Architecture - 500 families

(1) and (2) fit the definition of monoliths, whiles (3) and (4) are distributed architectures, though also in (3) one of the buildings can resemble a monolith if one chooses different architectures and heights for the buildings. For houses one can use a single architecture, agree on a set of predefined architectures, or have an architecture for each house, so that houses would look alike only by chance. One can also opt to have the same architecture for the buildings belonging to the same neighborhood (domain or subdomain). Moreover, the development could be split between multiple contractors that adhere to the same standards.

If the land is expensive, for example in big, overpopulated cities, when the infrastructure and the terrain allow it, one can build entirely on vertical, a skyscraper. If the land is cheap one can build a house for each family. The other architectures can be considered for everything in between.

A skyscraper is easier for externals to find (mailmen, couriers, milkmen, and other service providers) though will need a doorman to interact with them and probably a few other resources. Everybody will have the same address except the apartment number. There must be many elevators and the infrastructure must allow the flux of utilities up and down the floors, which can be challenging to achieve.

Within a village every person who needs to deliver or pick up something needs to traverse parts of the village. There are many services that need to be provided for both scenarios though the difference it will be the time that's needed to move in between addresses. In the virtual world this shouldn't matter unless one needs to inspect each house to check and/or retrieve something. The network of streets and the flux of utilities must scale with the population from the area.

A skyscraper will need materials of high quality that resist the various forces that apply on the building even in the most extreme situations. Not the same can be said about a house, which in theory needs more materials though a less solid foundation and the construction specifications are more relaxed. Moreover, a house needs smaller tools and is easier to build, unless each house has own design.

A skyscraper can host the families only when the construction is finished, and the needed certificates were approved. The same can be said about houses but the effort and time is considerably smaller, though the utilities must be also available, and they can have their own timeline.

The model is far from perfect, though it allows us to reason how changing the architecture affects various aspects. It doesn't reflect the reality because there's a big difference between the physical and virtual world. E.g., parts of the monolith can be used productively much earlier (though the core functionality might become available later), one doesn't need construction material but needs tool, the infrastructure must be available first, etc. Conversely, functional prototypes must be available beforehand, the needed skillset and a set of assumptions and other requirements must be met, etc.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

17 March 2024

Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence
Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them. 

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific). 

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc.  The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage. 

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble. 

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc. 

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

14 March 2024

Business Intelligence: Monolithic vs. Distributed Architecture (Part I: Zhamak Dehghani's Data Mesh - Debunked)

Business Intelligence
Business Intelligence Series

In [1] the author categorizes data warehouses (DWHs) and lakes as monolithic architectures, as opposed to data mesh's distributed architecture, which makes me circumspect about term's use. There are two general definitions of what monolithic means: (1) formed of a single large block (2) large, indivisible, and slow to change.

In software architecture one can differentiate between monolithic applications where the whole application is one block of code, multi-tier applications where the logic is split over several components with different functions that may reside on the same machine or are split non-redundantly between multiple machines, respectively distributed, where the application or its components run on multiple machines in parallel.

Distributed multi-tire applications are a natural evolution of the two types of applications, allowing to distribute redundantly components across multiple machines. Much later came the cloud where components are mostly entirely distributed within same or across distinct geo-locations, respectively cloud providers.

Data Warehouse vs. Data Lake vs. Lakehouse [2]

From licensing and maintenance convenience, a DWH resides typically on one powerful machine with many chores, though components can be moved to other machines and even distributed, the ETL functionality being probably the best candidate for this. In what concerns the overall schema there can be two or more data stores with different purposes (operational/transactional data stores, data marts), each of them with their own schema. Each such data store could be moved on its own machine though that's not feasible.

DWHs tend to be large because they need to accommodate a considerable number of tables where data is extracted, transformed, and maybe dumped for the various needs. With the proper design, also DWHs can be partitioned in domains (e.g. define one schema for each domain) and model domain-based perspectives, at least from a data consumer's perspective. The advantage a DWH offers is that one can create general dimensions and fact tables and build on top of them the domain-based perspectives, minimizing thus code's redundancy and reducing the costs.  

With this type of design, the DWH can be changed when needed, however there are several aspects to consider. First, it takes time until the development team can process the request, and this depends on the workload and priorities set. Secondly, implementing the changes should take a fair amount of time no matter of the overall architecture used, given that the transformations that need to be done on the data are largely the same. Therefore, one should not confuse the speed with which a team can start working on a change with the actual implementation of the change. Third, the possibility of reusing existing objects can speed up changes' implementation. 

Data lakes are distributed data repositories in which structured, unstructured and semi-structured data are dumped in raw form in standard file formats from the various sources and further prepared for consumption in other data files via data pipelines, notebooks and similar means. One can use the medallion architecture with a folder structure and adequate permissions for domains and build reports and other data artefacts on top. 

A data lake's value increases when is combined with the capabilities of a DWH (see dedicated SQL server pool) and/or analytics engine (see serverless SQL pool) that allow(s) building an enterprise semantic model on top of the data lake. The result is a data lakehouse that from data consumer's perspective and other aspects mentioned above is not much different than the DWH. The resulting architecture is distributed too. 

Especially in the context of cloud computing, referring to nowadays applications metaphorically (for advocative purposes) as monolithic or distributed is at most a matter of degree and not of distinction. Therefore, the reader should be careful!

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Databricks (2022) Data Lakehouse (link)

21 October 2023

Data Warehousing: Dynamics 365, the Data Lakehouse and the Medallion Architecture

Data Warehousing
Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well! 

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy. 

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches. 

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

03 July 2023

Data Migrations (DM): Comments on "Planning for Successful Data Migration" I (Architecture Aspects in Dynamics 365 Finance & Operations)

 

Data Migrations
Data Migrations Series

Introduction

This weekend I read the chapter 5 on Data Migration (Planning for Successful Data Migration) from Brent Dawson’s recently released book "Becoming a Dynamics 365 Finance and Supply Chain Solution Architect" (published by Packt Publishing, available on Amazon). The section on best practices makes many good points, however some of the practices require further clarifications, while some statements can be questionable as the context associated with them can make an important difference.  Overall however the recommendations hold.

Concerining Data Migrations (DM), besides a few teachnical recommendations, the author makes also several architectural recommendations that can be summarized as follows:

(1) put the data into a backup system or database, if possible, and use that system to the data extraction parts of the DM tasks;
(2) use a Tier 2 system for the majority of the development of the data packages;
(3) once the data packages validated, they can be used against production environments;
(4) don’t use the OData protocol for data transfer, but use the Batch API instead;
(5) don’t use dual-write for DM (technology used for data integrations), first complete the DM and after that enable the dual-write;
(6)  have a backup of the environments involved;
(7) have a good internet connection;
(8) plan an environment for DM (at a 2-tier environment, distinct from the one used for functional testing);
(9) for the gold configurations have an environment with limited access.

General Aspects
 
In a Data Migration there are at least 2 systems involved, though in more complex scenarios there can be one more source systems, respectively one or more target systems. At minimum there is a source and a target system.

Ideally, a target production environment should not be used for testing the data migration! On the other side, as long there’s a backup with a given state of the system (e.g. only configuration data, without master or transactional data) a system can be always restored to a previous state. This applies to D365 or to any system for which a database backup and restore can be applied. Even so, as best practice it isn’t recommended to use a production environment for testing as this can increase the complexity of the data migration.

Moreover, the same constraint applies also to the sandbox used for UAT (User Acceptance Testing), given that is supposed to represent at different points in time the same state as the production environment). Thus, at least a third environment will be needed.

There are no hard constraints on the source systems. Ideally, one should use the production source system(s) or environments that resemble the production environments. A read replica of the respective environment(s) will work as well, given that there are typically only reads involved.

The downside of accessing directly a production environment for DM is that the data changes frequently, which makes it more difficult to validate the DM logic – the time factor needing to be considered – data being added, deleted or changed. That’s why an environment with a recent snapshot from production would facilitate the process and would make sure that the DM workloads don’t affect production environment’s performance.

Often, a better alternative would be to have a database in between (aka DM layer) that contains only the data in scope of DM. ETL (Extract Transfer Load) jobs can extract the data on demand and in a consistent manner, this approach assuring a snapshot. This layer can be used to build, test and troubleshoot the DM logic, before Go-Live and after, as issues will be more likely raised by the business and will need to be mitigated.

There are also scenarios in which the direct access to source systems is not possible, a push, respectively a push & pull scenario being needed. If possible, it would be great if the data needed for migration could be exported directly from the source system(s) as needed by the target system(s). In some scenarios this might be achievable, though the bigger the differences in schemas betweeen the systems and the more complex the data, the more transformations are needed, respectively the more difficult it becomes to achieve this. Therefore, moving such logic to an intermediate DM layer would facilitate the DM architecture allowing to address many of the challenges. 
 
Batch API
 
Using Batch API could be a solution when the source environments allow only API access to the data (thus no direct access over SQL scripting) or when the volume of data makes the alternatives unusable. Indeed, OData seems to be slow or unusable when the volume of data exceeds a given threshold, even if the calls can be partitioned.

Another scenario for Batch API is when the source and target systems need to operate in parallel for a considerable amount of time that would make other approaches unusable. Even if a DM typically involves the replacement of one or more systems, there can be exceptions. Such scenario increases a DM’s complexity by several factors and should be avoided. Even if such scenarios seem to be logical and approachable at first sight, the benefits can be easily outrun by the downsides.

Backup

Hopefully, your organization has a backup and restore strategy for the production and other essential environments! The strategy needs to be extended also for the further environments available during the implementation. It’s also true that until Go-Live the target environments don’t suffer many changes. Ideally, a backup should be taken at least when important changes are made to the systems. This can involve the configuration as well the DM. E.g. a setup would be required after the configuration is completed, when the master data, respectively when the transactional data was migrated. A backup of all the systems involved should be taken before Go-Live.

Gold Configuration

Having a system with the gold configuration (the values used to configure the system) available can indeed facilitate the implementation and there are two main reasons for this. Primarily, the gold configuration allows to build reliable processes around its maintenance and to minimize the risk of having discrepancies between expectations and reality. Secondly, the database with the gold configuration can be used to easily setup a new environment and this might be needed often than thought (e.g. for dry runs).

However, in praxis the technical value is easily overrun by the financial aspect as such an environment is barely used and can involve significant costs. As alternative one can use the DAT legal entity from an available environment for storing the gold configuration common across all the legal entities and easily copy it to the other legal entities. In addition, it’s needed to document the deviations, however it’s recommended to document all configurations and use this as baseline for the post-Go-Live changes.
Indeed, the access to the gold configuration should be restrained as much as possible (e.g. only admin, consultants and/or data owners) and change policies should be enforced. Otherwise, one risks having different configurations between the environments. For Go-Live it is critical that the UAT and Go-Live environments have the same configuration.

Independently of the approach used to maintain the gold configuration, it’s recommended to perform a comparison between UAT and production environments to make sure that there are no differences. The comparison can be handled also via SQL scripts, the effort being well-spent when such comparisons needed to be done several times. Even if the data from production isn’t directly accessible, a snapshot of the production database can be copied in another environment. However, this approach requires a good understanding of the tables and/or entities involved. There will be cases (e.g. module parameters) in which it’s easier to perform a manual comparison.

Wrap Up

Coming back to the recommendations, the only points that require some discussion are (1), (2) and maybe (8), while (9) was discussed above (see 'Gold configuration' section).

The recommendation of putting the data into a backup system or database is too vague. A backup system can mean a backup database that can be accessible typically only over DRBMS or an instance of the system having a copy of the data (which usually implies a RDBMS as well). Besides these, a database can refer to a read replica of source system's database or to a DM layer.

Besides price and performance, the main differences between a Tier-1 and a Tier-2 environment (see also the Microsoft documentation) rely in the number of VM machines (aka boxes) involved, how the various components are distributed between them, respectively the edition of SQL Server used. Otherwise, for the users the system will look the same. The most important constraint is that a Tier-1 isn't suitable for UAT or performance testing. In other words, the environment will be slow for concurrent use.

If the performance is acceptable, if the volume of data and the number of users is small, a Tier-1 environment can be used for building data packages, performing initial DM dry-runs and other tasks. However, a Tier-2 resembles closer the production environment and if the UAT is performed using such a system, the more likely is to identify and address the bottlenecks related to performance. Unless they accept the costs blindly, the customers will need to trade between performance and costs from the perspective of their requirements and their business context. 

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)
Acronyms:
Artificial Intelligence (AI)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Data Mining (DM)
Knowledge Discovery in Databases (KDD)
Team Data Science Process (TDSP) 
Version Control System (VCS)
Visual Studio Team Services (VSTS)

Resources:
[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]
[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]
[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]
[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source
[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]
[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

13 January 2021

SQL Reloaded: D365FO Security Queries

The security architecture of Microsoft Dynamics 365 for Finance and Operations (D365FO) is based on a role-based model in which the access is not granted individually to users but through security roles. A set of roles are assigned to a user, each role having access to a set of privileges. In between duties can be assigned to one or more roles, respectively duties can contain several privileges. The model comes with a default set of security roles, which can be further extended to reflect organization's needs. 

Navigating through the model via the D365FO's UI isn't that straightforward as it should be. Probably it's much easier to export to Excel the associations between roles and privileges, respectively between roles and duties, or search punctually for a certain value within the same dataset directly in the database. The following queries can be run into a non-production environment:

-- security role's assigned privileges
SELECT SRO.AOTName [Role AOT Name]
, SRO.Name [Role Name]
, SRO.Description [Role Description]
, SPI.Identifier [Privilege Identifier]
, SPI.Name [Privilege Name]
, SPI.Description [Privilege Description]
FROM dbo.SecurityRolePrivilegeExplodedGraph SRP
     JOIN dbo.SecurityRole SRO
          ON SRP.SecurityRole = SRO.RecId
         JOIN dbo.SecurityPrivilege SPI
           ON SRP.SecurityPrivilege = SPI.RecId
WHERE SPI.NAME IN ('Maintain accounts receivable aging period definitions'
, 'Maintain accounts payable aging period definitions')
ORDER BY SRO.AOTName


-- security role's assigned duties
SELECT  SRO.AOTName [Role AOT Name]
, SRO.Name [Role Name]
, SRO.Description [Role Description]
, SDU.Identifier [Duty Identifier]
, SDU.Name [Duty Name]
, SDU.Description [Duty Description]
FROM dbo.SecurityRoleDutyExplodedGraph SRP
     JOIN dbo.SecurityRole SRO
          ON SRP.SecurityRole = SRO.RecId
        JOIN dbo.SecurityDuty SDU
          ON SRP.SecurityDuty = SDU.RecId
WHERE SRO.Name LIKE 'Accounting Manager%'
ORDER BY SDU.Identifier 

Below you can find a short description of the security tables considered above:

TableDescription
SecurityDutycontains the list of duties defined by the security AOT role node
SecurityPrivilegecontains the list of privileges defined by the security AOT role node
SecurityRolecontains the list of roles defined by the security AOT role node
SecurityRoleDutyExplodedGraphcontains the list of role to duty mappings and role to privilege mappings as defined by the AOT security role
SecurityRoleExplodedGraphcontains all role relationships, direct or indirect, as defined by the AOT sub role nodes of the security role nodes.
SecurityRolePrivilegeExplodedGraphcontains the list of role to privilege mappings and role to privilege mappings as defined by the AOT security role
SecurityUserRolecontains the user to role mappings

Resources:
[1] Microsoft Dynamics 365 (2020) Security architecture [source]
[2] Microsoft Dynamics 365 (2020) Role-based security [source]

27 December 2020

Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing
Data Warehousing Series

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

Previous Post <<||>> Next Post

References:
[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.