SQL Troubles

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

Business Intelligence Series

One metaphor that can be used to understand the difference between monolith and distributed architectures, respectively between data warehouses and data mesh-based architectures as per Dehghani’s definition [1] - think that you need to accommodate 500 families (the data products to be built). There are several options: (1) build a skyscraper (developing on vertical) (2) build a complex of high buildings and develop by horizontal and vertical but finding a balance between the two; (3) to split (aka distribute) the second option and create several buildings; (4) build for each family a house, creating a village or a neighborhood.

Monolith vs. Distributed Architecture - 500 families

(1) and (2) fit the definition of monoliths, whiles (3) and (4) are distributed architectures, though also in (3) one of the buildings can resemble a monolith if one chooses different architectures and heights for the buildings. For houses one can use a single architecture, agree on a set of predefined architectures, or have an architecture for each house, so that houses would look alike only by chance. One can also opt to have the same architecture for the buildings belonging to the same neighborhood (domain or subdomain). Moreover, the development could be split between multiple contractors that adhere to the same standards.

If the land is expensive, for example in big, overpopulated cities, when the infrastructure and the terrain allow it, one can build entirely on vertical, a skyscraper. If the land is cheap one can build a house for each family. The other architectures can be considered for everything in between.

A skyscraper is easier for externals to find (mailmen, couriers, milkmen, and other service providers) though will need a doorman to interact with them and probably a few other resources. Everybody will have the same address except the apartment number. There must be many elevators and the infrastructure must allow the flux of utilities up and down the floors, which can be challenging to achieve.

Within a village every person who needs to deliver or pick up something needs to traverse parts of the village. There are many services that need to be provided for both scenarios though the difference it will be the time that's needed to move in between addresses. In the virtual world this shouldn't matter unless one needs to inspect each house to check and/or retrieve something. The network of streets and the flux of utilities must scale with the population from the area.

A skyscraper will need materials of high quality that resist the various forces that apply on the building even in the most extreme situations. Not the same can be said about a house, which in theory needs more materials though a less solid foundation and the construction specifications are more relaxed. Moreover, a house needs smaller tools and is easier to build, unless each house has own design.

A skyscraper can host the families only when the construction is finished, and the needed certificates were approved. The same can be said about houses but the effort and time is considerably smaller, though the utilities must be also available, and they can have their own timeline.

The model is far from perfect, though it allows us to reason how changing the architecture affects various aspects. It doesn't reflect the reality because there's a big difference between the physical and virtual world. E.g., parts of the monolith can be used productively much earlier (though the core functionality might become available later), one doesn't need construction material but needs tool, the infrastructure must be available first, etc. Conversely, functional prototypes must be available beforehand, the needed skillset and a set of assumptions and other requirements must be met, etc.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

20 March 2024

🗄️Data Management: Master Data Management (Part I: Understanding Integration Challenges) [Answer]

Data Management Series

Answering Piethein Strengholt’s post [1] on Master Data Management’s (MDM) integration challenges, the author of "Data Management at Scale".

Master data can be managed within individual domains though the boundaries must be clearly defined, and some coordination is needed. Attempting to partition the entities based on domains doesn’t always work. The partition needs to be performed at attribute level, though even then might be some exceptions involved (e.g. some Products are only for Finance to use). One can identify then attributes inside of the system to create the boundaries.

MDM is simple if you have the right systems, processes, procedures, roles, and data culture in place. Unfortunately, people make it too complicated – oh, we need a nice shiny system for managing the data before they are entered in ERP or other systems, we need a system for storing and maintaining the metadata, and another system for managing the policies, and the story goes on. The lack of systems is given as reason why people make no progress. Moreover, people will want to integrate the systems, increasing the overall complexity of the ecosystem.

The data should be cleaned in the source systems and assessed against the same. If that's not possible, then you have the wrong system! A set of well-built reports can make data assessment possible.

The metadata and policies can be maintained in Excel (and stored in SharePoint), SharePoint or a similar system that supports versioning. Also, for other topics can be found pragmatic solutions.

ERP systems allow us to define workflows and enable a master data record to be published only when the information is complete, though there will always be exceptions (e.g., a Purchase Order must be sent today). Such exceptions make people circumvent the MDM systems with all the issues deriving from this.

Adding an MDM system within an architecture tends to increase the complexity of the overall infrastructure and create more bottlenecks. Occasionally, it just replicates the structures existing in the target system(s).

Integrations are supposed to reduce the effort, though in the past 20 years I never saw an integration to work without issues, even in what MDM concerns. One of the main issues is that the solutions just synchronized the data without considering the processual dependencies, and sometimes also the referential dependencies. The time needed for troubleshooting the integrations can easily exceed the time for importing the data manually over an upload mechanism.

To make the integration work the MDM will arrive to duplicate the all the validation available in the target system(s). This can make sense when daily or weekly a considerable volume of master data is created. Native connectors simplify the integrations, especially when it can handle the errors transparently and allow to modify the records manually, though the issues start as soon the target system is extended with more attributes or other structures.

If an organization has an MDM system, then all the master data should come from the MDM. As soon as a bidirectional synchronization is used (and other integrations might require this), Pandora’s box is open. One can define hard rules, though again, there are always exceptions in which manual interference is needed.

Attempting an integration of reference data is not recommended. ERP systems can have hundreds of such entities. Some organizations tend to have a golden system (a copy of production) with all the reference data. It works for some time, until people realize that the solution is expensive and time-consuming.

MDM systems do make sense in certain scenarios, though to get the integrations right can involve a considerable effort and certain assumptions and requirements must be met.

Previous Post <<||>> Next Post

References:
[1] Piethein Strengholt (2023) Understanding Master Data Management’s Integration Challenges (link)

19 March 2024

📊R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points)

For a previous post on inflection points I needed a few examples, so I thought to write the code in the R language, which I did. Here's the final output:

Examples of Inflection Points

And, here's the code used to generate the above graphic:

par(mfrow = c(2,2)) #2x2 matrix display

# Example A: Inflection point with bifurcation
curve(x^3+20, -3,3, col = "black", main="(A) Inflection Point with Bifurcation")
curve(-x^2+20, 0, 3, add=TRUE, col="blue")
text (2, 10, "f(x)=-x^2+20, [0,3]", pos=1, offset = 1) #label inflection point
points(0, 20, col = "red", pch = 19) #inflection point 
text (0, 20, "inflection point", pos=1, offset = 1) #label inflection point


# Example B: Inflection point with Up & Down Concavity
curve(x^3-3*x^2-9*x+1, -3,6, main="(B) Inflection point with Up & Down Concavity")
points(1, -10, col = "red", pch = 19) #inflection point 
text (1, -10, "inflection point", pos=4, offset = 1) #label inflection point
text (-1, -10, "concave down", pos=3, offset = 1) 
text (-1, -10, "f''(x)<0", pos=1, offset = 0) 
text (2, 5, "concave up", pos=3, offset = 1)
text (2, 5, "f''(x)>0", pos=1, offset = 0) 


# Example C: Inflection point for multiple curves
curve(x^3-3*x+2, -3,3, col ="black", ylab="x^n-3*x+2, n = 2..5", main="(C) Inflection Point for Multiple Curves")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^2-3*x+2,-3,3, add=TRUE, col="blue")
text (-2, 10, "n=2", pos=1) #label curve
curve(x^4-3*x+2,-3,3, add=TRUE, col="brown")
text (-1, 10, "n=4", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="green")
text (-2, -10, "n=5", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=4, offset = 1) #label inflection point
title("", line = -3, outer = TRUE)


# Example D: Inflection Point with fast change
curve(x^5-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 5,7,9", main="(D) Inflection Point with Slow vs. Fast Change")
text (-3, -100, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-2.25, -100, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="brown")
text (-1.5, -100, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

mtext("© sql-troubles@blogspot.com @sql_troubles, 2024", side = 1, line = 4, adj = 1, col = "dodgerblue4", cex = .7)
#title("Examples of Inflection Points", line = -1, outer = TRUE)

Mathematically, an inflection point is a point on a smooth (plane) curve at which the curvature changes sign and where the second derivative is 0 [1]. The curvature intuitively measures the amount by which a curve deviates from being a straight line.

In example A, the main function has an inflection point, while the second function defined only for the interval [0,3] is used to represent a descending curve (aka bifurcation) for which the same point is a maximum point.

In example B, the function was chosen to represent an example with a concave down (for which the second derivative is negative) and a concave up (for which the second derivative is positive) section. So what comes after an inflection point is not necessarily a monotonic increasing function.

In example C are depicted several functions based on a varying power of the first coefficient which have the same inflection point. One could have shown only the behavior of the functions after the inflection point, while before choosing only one of the functions (see example A).

In example D is the same function as in example C with varying powers of the first coefficient considered, though for higher powers than in example C. I kept the function for n=5 to offer a basis for comparison. Apparently, the strange thing is that around the inflection point the change seems to be small and linear, which is not the case. The two graphics are correct though, because as basis is considered the scale for n=5, while in C the basis is n=3 (one scales the graphic further away from the inflection point). If one adds n=3 as the first function in the example D, the new chart will resemble C. Unfortunately, this behavior can be misused to show something like being linear around the inflection point, which is not the case.

# Example E: Inflection Point with slow vs. fast change extended
curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(E) Inflection Point with Slow vs. Fast Change")
text (-3, -10, "n=3", pos=1) #label curve
curve(x^5-3*x+2,-3,3, add=TRUE, col="brown")
text (-2, -10, "n=5", pos=1) #label curve
curve(x^7-3*x+2, add=TRUE, col="green")
text (-1.5, -10, "n=7", pos=1) #label curve
curve(x^9-3*x+2, add=TRUE, col="orange")
text (-1, -5, "n=9", pos=1) #label curve
points(0, 2, col = "red", pch = 19) #inflection point 
text (0, 2, "inflection point", pos=3, offset = 1) #label inflection point

Comments:
(1) I cheated a bit calculating the second derivative manually, which is an easy task for polynomials. There seems to be methods for calculating the inflection point, though the focus was on providing the examples.
(2) The examples C and D could have been implemented as part of a loop, though I needed anyway to add the labels for each curve individually. Here's the modified code to support a loop:

# Example F: Inflection Point with slow vs. fast change with loop
n <- list(5,7,9)
color <- list("brown", "green", "orange")

curve(x^3-3*x+2,-3,3, col="black", ylab="x^n-3*x+2, n = 3,5,7,9", main="(F) Inflection Point with Slow vs. Fast Change")
for (i in seq_along(n))
{
ind <- as.numeric(n[i])
curve(x^ind-3*x+2,-3,3, add=TRUE, col=toString(color[i]))
}

text (-3, -10, "n=3", pos=1) #label curve
text (-2, -10, "n=5", pos=1) #label curve
text (-1, -5, "n=9", pos=1) #label curve
text (-1.5, -10, "n=7", pos=1) #label curve

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Wikipedia (2023) Inflection point (link)

𖣯Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

Strategic Management Series

"Data mesh is what comes after an inflection point, shifting our approach, attitude, and technology toward data. Mathematically, an inflection point is a magic moment at which a curve stops bending one way and starts curving in the other direction. It’s a point that the old picture dissolves, giving way to a new one. [...] The impacts affect business agility, the ability to get value from data, and resilience to change. In the center is the inflection point, where we have a choice to make: to continue with our existing approach and, at best, reach a plateau of impact or take the data mesh approach with the promise of reaching new heights." [1]

I tried to understand the "metaphor" behind the quote. As the author through another quote pinpoints, the metaphor is borrowed from Andrew Groove:

"An inflection point occurs where the old strategic picture dissolves and gives way to the new, allowing the business to ascend to new heights. However, if you don’t navigate your way through an inflection point, you go through a peak and after the peak the business declines. [...] Put another way, a strategic inflection point is when the balance of forces shifts from the old structure, from the old ways of doing business and the old ways of competing, to the new. Before" [2]

The second part of the quote clarifies the role of the inflection point - the shift from a structure, respectively organization or system to a new one. The inflection point is not when we take a decision, but when the decision we took, and the impact shifts the balance. If the data mesh comes after the inflection point (see A), then there must be some kind of causality that converges uniquely toward the data mesh, which is questionable, if not illogical. A data mesh eventually makes sense after organizations reached a certain scale and thus is likely improbable to be adopted by small to medium businesses. Even for large organizations the data mesh may not be a viable solution if it doesn't have a proven record of success.

I could understand if the author would have said that the data mesh will lead to an inflection point after its adoption, as is the case of transformative/disruptive technologies. Unfortunately, the tracking record of BI and Data Analytics projects doesn't give many hopes for such a magical moment to happen. Probably, becoming a data-driven organization could have such an effect, though for many organizations the effects are still far from expectations.

There's another point to consider. A curve with inflection points can contain up and down concavities (see B) or there can be multiple curves passing through an inflection point (see C) and the continuation can be on any of the curves.

Examples of Inflection Points [3]

The change can be fast or slow (see D), and in the latter it may take a long time for change to be perceived. Also [2] notes that the perception that something changed can happen in stages. Moreover, the inflection point can be only local and doesn't describe the future evolution of the curve, which to say that the curve can change the trajectory shortly after that. It happens in business processes and policy implementations that after a change was made in extremis to alleviate an issue a slight improvement is recognized after which the performance decays sharply. It's the case of situations in which the symptoms and not the root causes were addressed.

More appropriate to describe the change would be a tipping point, which can be defined as a critical threshold beyond which a system (the organization) reorganizes/changes, often abruptly and/or irreversible.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Andrew S Grove (1988) "Only the Paranoid Survive: How to Exploit the Crisis Points that Challenge Every Company and Career"
[3] SQL Troubles (2024) R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points) (link)

18 March 2024

♟️Strategic Management: Strategy [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.
Last updated: 18-Mar-2024

Strategy

{definition} "the determination of the long-term goals and objectives of an enterprise, and the adoption of courses of action and the allocation of resources necessary for carrying out these goals" [4]
{goal} bring all tools and insights together to create an integrative narrative about what the organization should do moving forward [1]
a good strategy emerges out of the values, opportunities and capabilities of the organization [1]

{characteristic} robust
{characteristic} flexible
{characteristic} needs to embrace the uncertainty and complexity of the world
{characteristic} fact-based and informed by research and analytics
{characteristic} testable

{concept} strategy analysis

{definition} the assessment of an organization's current competitive position and the identification of future valuable competitive positions and how the firm plans to achieve them [1]

done from a general perspective

in terms of different functional elements within the organization [1]
in terms of being integrated across different concepts and tools and frameworks [1]

a good strategic analysis integrates various tools and frameworks that are in our strategist toolkit [1]

approachable in terms of

dynamics
complexity
competition

{step} identify the mission and values of the organization

critical for understanding what the firm values and how it may influence where opportunities they look for and what actions they might be willing to take

{step} analyze the competitive environment

looking at what opportunities the environment provides, how are competitors likely to react

{step} analyze competitive positions

think about own capabilities are and how they might relate to the opportunities that are available

{step} analyze and recommend strategic actions

actions for future improvement

{question} how do we create more value?
{question} how can we improve our current competitive position?
{question} how can we in essence, create more value in our competitive environment

alternatives

scaling the business
entering new markets
innovating
acquiring a competitor/another player within a market segment of interest

recommendations

{question} what do we recommend doing going forward?
{question} what are the underlying assumptions of these recommendations?
{question} do they meet our tests that we might have for providing value?
move from analysis to action

actions come from asking a series of questions about what opportunities, what actions can we take moving forward

{step} strategy formulation
{step} strategy implementation

{tool} competitor analysis

{question} what market is the firm in, and who are the players in these markets?

{tool} environmental analysis

{benefit} provides a picture on the broader competitive environment
{question} what are the major trends impacting this industry?
{question} are there changes in the sociopolitical environment that are going to have important implications for this industry?
{question} is this an attractive market or the barrier to competition?

{tool} five forces analysis

{benefit} provides an overview of the market structure/industry structure
{benefit} helps understand the nature of the competitive game that we are playing as we then devise future strategies [1]

provides a dynamic perspective in our understanding of a competitive market

{question} how's the competitive structure in a market likely to evolve?

{tool} competitive lifestyle analysis
{tool} SWOT (strengths, weaknesses, opportunities, threats) analysis
{tool} stakeholder analysis

{benefit} valuable in trying to understand those mission and values and then the others expectations of a firm

{tool} capabilities analysis

{question} what are the firm's unique resources and capabilities?
{question} how sustainable as any advantage that these assets provide?

{tool} portfolio planning matrix

{benefit} helps us now understand how they might leverage these assets across markets, so as to improve their position in any given market here
{question} how should we position ourselves in the market relative to our rivals?

{tool} capability analysis

{benefit} understand what the firm does well and see what opportunities they might ultimately want to attack and go after in terms of these valuable competitive positions

via Strategy Maps and Portfolio Planning matrices

{tool} hypothesis testing

{question} how competitors are likely to react to these actions?
{question} does it make sense in the future worlds we envision?
[game theory] pay off matrices can be useful to understand what actions taken by various competitors within an industry

{tool} scenario planning

{benefit} helps us envision future scenarios and then work back to understand what are the actions we might need to take in those various scenarios if they play out.
{question} does it provide strategic flexibility?

{tool} real options analysis

highlights the desire to have strategic flexibility or at least the value of strategic flexibility provides

{tool} acquisition analysis

{benefit} helps understand the value of certain action versus others
{benefit} useful as an understanding of opportunity costs for other strategic investments one might make
focused on mergers and acquisitions

{tool} If-Then thinking

sequential in nature

different from causal logic

commonly used in network diagrams, flow charts, Gannt charts, and computer programming

{tool} Balanced Scorecard

{definition} a framework to look at the strategy used for value creation from four different perspectives [5]

{perspective} financial

{scope} the strategy for growth, profitability, and risk viewed from the perspective of the shareholder [5]
{question} what are the financial objectives for growth and productivity? [5]
{question} what are the major sources of growth? [5]
{question} If we succeed, how will we look to our shareholders? [5]

{perspective} customer

{scope} the strategy for creating value and differentiation from the perspective of the customer [5]
{question} who are the target customers that will generate revenue growth and a more profitable mix of products and services? [5]
{question} what are their objectives, and how do we measure success with them? [5]

{perspective} internal business processes

{scope} the strategic priorities for various business processes, which create customer and shareholder satisfaction [5]

{perspective} learning and growth

{scope} deﬁnes the skills, technologies, and corporate culture needed to support the strategy.

enable a company to align its human resources and IT with its strategy

{benefit} enables the strategic hypotheses to be described as a set of cause-and-effect relationships that are explicit and testable [5]

require identifying the activities that are the drivers (or lead indicators) of the desired outcomes (lag indicators) [5]
everyone in the organization must clearly understand the underlying hypotheses, to align resources with the hypotheses, to test the hypotheses continually, and to adapt as required in real time [5]

{tool} strategy map

{definition} a visual representation of a company’s critical objectives and the crucial relationships that drive organizational performance [2]

shows the cause-and effect links by which speciﬁc improvements create desired outcomes [2]

{benefit} shows how an organization will convert its initiatives and resources–including intangible assets such as corporate culture and employee knowledge into tangible outcomes [2]

{component} mission

{question} why we exist?

{component} core values

{question} what we believe in?
⇐ mission and the core values remain fairly stable over time [5]

{component} vision

{question} what we want to be?
paints a picture of the future that clarifies the direction of the organization [5]

helps-individuals to understand why and how they should support the organization [5]

Previous Post <<||>> Next Post

References:
[1] University of Virginia (2022) Strategic Planning and Execution (MOOC, Coursera)
[2] Robert S Kaplan & David P Norton (2000) Having Trouble with Your Strategy? Then Map It (link)
[3] Harold Kerzner (2001) Strategic planning for project management using a project management maturity model
[4] Alfred D Chandler Jr. (1962) "Strategy and Structure"
[5] Robert S Kaplan & David P Norton (2000) The Strategy-focused Organization: How Balanced Scorecard Companies Thrive in the New Business Environment

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

Business Intelligence Series

One can define a data product as the smallest unit of data-driven architecture that can be independently deployed and managed (aka product quantum) [1]. In other terms one can think of a data product like a box (or Lego piece) which takes data as inputs, performs several transformations on the data from which result several output data (or even data visualizations or a hybrid between data, visualizations and other content).

At high-level each Data Analytics solution can be regarded as a set of inputs, a set of outputs and the transformations that must be performed on the inputs to generate the outputs. The inputs are the data from the operational systems, while the outputs are analytics data that can be anything from data to KPIs and other metrics. A data mart, data warehouse, lakehouse and data mesh can be abstracted in this way, though different scales apply.

For creating data products within a data mesh, given a set of inputs, outputs and transformations, the challenge is to find horizontal and vertical partitions within these areas to create something that looks like a Lego structure, in which each piece of Lego represents a data product, while its color represents the membership to a business domain. Each such piece is self-contained and contains a set of transformations, respectively intermediary inputs and outputs. Multiple such pieces can be combined in a linear or hierarchical fashion to transform the initial inputs into the final outputs.

Data Products with a Data Mesh

Finding such a partition is possible though it involves a considerable effort, especially in designing the whole thing - identifying each Lego piece uniquely. When each department is on its own and develops its own Lego pieces, there's no guarantee that the pieces from the various domains will fit together to built something cohesive, performant, secure or well-structured. Is like building a house from modules, the pieces must fit together. That would be the role of governance (federated computational governance) - to align and coordinate the effort.

Conversely, there are transformations that need to be replicated for obtaining autonomous data products, and the volume of such overlapping can be considerable high. Consider for example the logic available in reports and how often it needs to be replicated. Alternatively, one can create intermediary data products, when that's feasible.

It's challenging to define the inputs and outputs for a Lego piece. Now imagine in doing the same for a whole set of such pieces depending on each other! This might work for small pieces of data and entities quite stable in their lifetime (e.g. playlists, artists, songs), but with complex information systems the effort can increase by a few factors. Moreover, the complexity of the structure increases as soon the Lego pieces expand beyond their initial design. It's like the real Lego pieces would grow within the available space but still keep the initial structure - strange constructs may result, which even if they work, change the gravity center of the edifice in other directions. There will be thus limits to grow that can easily lead to duplication of functionality to overcome such challenges.

Each new output or change in the initial input for this magic boxes involves a change of all the intermediary Lego pieces from input to output. Just recollect the last experience of defining the inputs and the outputs for an important complex report, how many iterations and how much effort was involved. This might have been an extreme case, though how realistic is the assumption that with data products everything will go smoother? No matter of the effort involved in design, there will be always changes and further iterations involved.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)

16 March 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part VII: Think for Yourself!)

Business Intelligence Series

After almost a quarter-century of professional experience the best advice I could give to younger professionals is to "gather information and think for themselves", and with this the reader can close the page and move forward! Anyway, everybody seems to be looking for sudden enlightenment with minimal effort, as if the effort has no meaning in the process!

In whatever endeavor you are caught, it makes sense to do upfront a bit of thinking for yourself - what's the task, or more general the problem, which are the main aspects and interpretations, which are the goals, respectively the objectives, how a solution might look like, respectively how can it be solved, how long it could take, etc. This exercise is important for familiarizing yourself with the problem and creating a skeleton on which you can build further. It can be just vague ideas or something more complex, though no matter the overall depth is important to do some thinking for yourself!

Then, you should do some research to identify how others approached and maybe solved the problem, what were the justifications, assumptions, heuristics, strategies, and other tools used in sense-making and problem solving. When doing research, one should not stop with the first answer and go with it. It makes sense to allocate a fair amount of time for information gathering, structuring the findings in a reusable way (e.g. tables, mind maps or other tools used for knowledge mapping), and looking at the problem from the multiple perspectives derived from them. It's important to gather several perspectives, otherwise the decisions have a high chance of being biased. Just because others preferred a certain approach, it doesn't mean one should follow it, at least not blindly!

The purpose of research is multifold. First, one should try not to reinvent the wheel. I know, it can be fun, and a lot can be learned in the process, though when time is an important commodity, it's important to be pragmatic! Secondly, new information can provide new perspectives - one can learn a lot from other people’s thinking. The pragmatism of problem solvers should be combined, when possible, with the idealism of theories. Thus, one can make connections between ideas that aren't connected at first sight.

Once a good share of facts was gathered, you can review the new information in respect to the previous ones and devise from there several approaches worthy of attack. Once the facts are reviewed, there are probably strong arguments made by others to follow one approach over the others. However, one can show that has reached a maturity when is able to evaluate the information and take a decision based on the respective information, even if the decision is not by far perfect.

One should try to develop a feeling for decision making, even if this seems to be more of a gut-feeling and stressful at times. When possible, one should attempt to collect and/or use data, though collecting data is often a luxury that tends to postpone the decision making, respectively be misused by people just to confirm their biases. Conversely, if there's any important benefit associated with it, one can collect data to validate in time one's decision, though that's a more of a scientist’s approach.

I know that's easier to go with the general opinion and do what others advise, especially when some ideas are popular and/or come from experts, though then would mean to also follow others' mistakes and biases. Occasionally, that can be acceptable, especially when the impact is neglectable, however each decision we are confronted with is an opportunity to learn something, to make a difference!

Previous Post <<||>> Next Post

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources.

Last updated: 17-Mar-2024

Data Products with a Data Mesh

Data Mesh

{definition} "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1]

⇐ there is no default standard or reference implementation of data mesh and its components [2]

{definition} a type of decentralized data architecture that organizes data based on different business domains [2]

⇐ no centralized data architecture coexists with data mesh, unless in transition [1]
distributes the modeling of analytical data, the data itself and its ownership [1]

{characteristic} partitions data around business domains and gives data ownership to the domains [1]

each domain can model their data according to their context [1]
there can be multiple models of the same concept in different domains gives the data sharing responsibility to those who are most intimately familiar with the data [1]

endorses multiple models of the data

data can be read from one domain, transformed and stored by another domain [1]

{characteristic} evolutionary execution process
{characteristic} agnostic of the underlying technology and infrastructure [1]
{aim} respond gracefully to change [1]
{aim} sustain agility in the face of growth [1]
{aim} increase the ratio of value from data to investment [1]
{principle} data as a product

{goal} business domains become accountable to share their data as a product to data users
{goal} introduce a new unit of logical architecture that controls and encapsulates all the structural components needed to share data as a product autonomously [1]
{goal} adhere to a set of acceptance criteria that assure the usability, quality, understandability, accessibility and interoperability of data products*
usability characteristics

{principle} domain-oriented ownership

{goal} decentralize the ownership of sharing analytical data to business domains that are closest to the data [1]
{goal} decompose logically the data artefacts based on the business domain they represent and manage their life cycle independently [1]
{goal} align business, technology and analytical data [1]

{principle} self-serve data platform

{goal} provide a self-serve data platform to empower domain-oriented teams to manage and govern the end-to-end life cycle of their data products* [1]
{goal} streamline the experience of data consumers to discover, access, and use the data products [1]

{principle} federated computational governance

{goal} implement a federated decision making and accountability structure that balances the autonomy and agility of domains, while respecting the global conformance, interoperability and security of the mesh* [1]
{goal} codifying and automated execution of policies at a fine-grained level [1]
⇐ the principles represent a generalization and adaptation of practices that address the scale of organization digitization* [1]

{concept} decentralization of data products

{requirement} ability to compose data across different modes of access and topologies [1]

data needs to be agnostic to the syntax of data, underlying storage type, and mode of access to it [1]

many of the existing composability techniques that assume homogeneous data won’t work

e.g. defining primary and foreign key relationships between tables of a single schema [1]

{requirement} ability to discover and learn what is relatable and decentral [1]
{requirement} ability to seamlessly link relatable data [1]
{requirement} ability to relate data temporally [1]

{concept} data product

the smallest unit of data-based architecture that can be independently deployed and managed (aka product quantum) [1]
provides a set of explicitly defined and data sharing contracts
provides a truthful portion of the reality for a particular domain (aka single slice of truth) [1]
constructed in alignment with the source domain [3]
{characteristic} autonomous

its life cycle and model are managed independently of other data products [1]

{characteristic} discoverable

via a centralized registry or catalog that list the available datasets with some additional information about each dataset, the owners, the location, sample data, etc. [1]

{characteristic} addressable

via a permanent and unique address to the data user to programmatically or manually access it [1]

{characteristic} understandable

involves getting to know the semantics of its underlying data and the syntax in which the data is encoded [1]
describes which entities it encapsulates, the relationships between them, and their adjacent data products [1]

{characteristic} trustworthy and truthful

represents the fact of the business correctly [1]
provides data provenance and data lineage [1]

{characteristic} natively accessible

make it possible for various data users to access and read its data in their native mode of access [1]
meant to be broadcast and shared widely [3]

{characteristic} interoperable and composable

follows a set of standards and harmonization rules that allow linking data across domains easily [1]

{characteristic} valuable on its own

must have some inherent value for the data users [1]

{characteristic} secure

the access control is validated by the data product, right in the flow of data, access, read, or write [1]

⇐ the access control policies can change dynamically

{characteristic} multimodal

there is no definitive 'right way' to create a data product, nor is there a single expected form, format, or mode that it is expected to take [3]

shares its logs, traces, and metrics while consuming, transforming, and sharing data [1]
{concept} data quantum (aka product data quantum, architectural quantum)

unit of logical architecture that controls and encapsulates all the structural components needed to share a data product [1]

{component} data
{component} metadata
{component} code
{component} policies
{component} dependencies' listing

{concept} data product observability

monitor the operational health of the mesh
debug and perform postmortem analysis
perform audits
understand data lineage

{concept} logs

immutable, timestamped, and often structured events that are produced as a result of processing and the execution of a particular task [1]
used for debugging and root cause analysis

{concept} traces

records of causally related distributed events [1]

{concept} metrics

objectively quantifiable parameters that continue to communicate build-time and runtime characteristics of data products [1]

artefacts

e.g. data, code, metadata, policies

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Zhamak Dehghani (2019) How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh (link)
[3] Adam Bellemare (2023) Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures

14 March 2024

🧭Business Intelligence: Architecture (Part I: Monolithic vs. Distributed and Zhamak Dehghani's Data Mesh - Debunked)

Business Intelligence Series

In [1] the author categorizes data warehouses (DWHs) and lakes as monolithic architectures, as opposed to data mesh's distributed architecture, which makes me circumspect about term's use. There are two general definitions of what monolithic means: (1) formed of a single large block (2) large, indivisible, and slow to change.

In software architecture one can differentiate between monolithic applications where the whole application is one block of code, multi-tier applications where the logic is split over several components with different functions that may reside on the same machine or are split non-redundantly between multiple machines, respectively distributed, where the application or its components run on multiple machines in parallel.

Distributed multi-tire applications are a natural evolution of the two types of applications, allowing to distribute redundantly components across multiple machines. Much later came the cloud where components are mostly entirely distributed within same or across distinct geo-locations, respectively cloud providers.

Data Warehouse vs. Data Lake vs. Lakehouse [2]

From licensing and maintenance convenience, a DWH resides typically on one powerful machine with many chores, though components can be moved to other machines and even distributed, the ETL functionality being probably the best candidate for this. In what concerns the overall schema there can be two or more data stores with different purposes (operational/transactional data stores, data marts), each of them with their own schema. Each such data store could be moved on its own machine though that's not feasible.

DWHs tend to be large because they need to accommodate a considerable number of tables where data is extracted, transformed, and maybe dumped for the various needs. With the proper design, also DWHs can be partitioned in domains (e.g. define one schema for each domain) and model domain-based perspectives, at least from a data consumer's perspective. The advantage a DWH offers is that one can create general dimensions and fact tables and build on top of them the domain-based perspectives, minimizing thus code's redundancy and reducing the costs.

With this type of design, the DWH can be changed when needed, however there are several aspects to consider. First, it takes time until the development team can process the request, and this depends on the workload and priorities set. Secondly, implementing the changes should take a fair amount of time no matter of the overall architecture used, given that the transformations that need to be done on the data are largely the same. Therefore, one should not confuse the speed with which a team can start working on a change with the actual implementation of the change. Third, the possibility of reusing existing objects can speed up changes' implementation.

Data lakes are distributed data repositories in which structured, unstructured and semi-structured data are dumped in raw form in standard file formats from the various sources and further prepared for consumption in other data files via data pipelines, notebooks and similar means. One can use the medallion architecture with a folder structure and adequate permissions for domains and build reports and other data artefacts on top.

A data lake's value increases when is combined with the capabilities of a DWH (see dedicated SQL server pool) and/or analytics engine (see serverless SQL pool) that allow(s) building an enterprise semantic model on top of the data lake. The result is a data lakehouse that from data consumer's perspective and other aspects mentioned above is not much different than the DWH. The resulting architecture is distributed too.

Especially in the context of cloud computing, referring to nowadays applications metaphorically (for advocative purposes) as monolithic or distributed is at most a matter of degree and not of distinction. Therefore, the reader should be careful!

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale (book review)
[2] Databricks (2022) Data Lakehouse (link)

13 March 2024

🔖Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

Zhamak Dehghani's "Data Mesh: Delivering Data-Driven Value at Scale" (2021) is a must read book for the data professional. So, here I am, finally managing to read it and give it some thought, even if it will probably take more time and a few more reads for the ideas to grow. Working in the fields of Business Intelligence and Software Engineering for almost a quarter-century, I think I can understand the historical background and the direction of the ideas presented in the book. There are many good ideas but also formulations that make me circumspect about the applicability of some assumptions and requirements considered.

So, after data marts, warehouses, lakes and lakehouses, the data mesh paradigm seems to be the new shiny thing that will bring organizations beyond the inflection point with tipping potential from where organization's growth will have an exponential effect. At least this seems to be the first impression when reading the first chapters.

The book follows to some degree the advocative tone of promoting that "our shiny thing is much better than previous thing", or "how bad the previous architectures or paradigms were and how good the new ones are" (see [2]). Architectures and paradigms evolve with the available technologies and our perception of what is important for businesses. Old and new have their place in the order of things, and the old will continue to exist, at least until the new proves its feasibility.

The definition of the data mash as "a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations" [1] is too abstract even if it reflects at high level what the concept is about. Compared to other material I read on the topic, the book succeeds in explaining the related concepts as well the goals (called definitions) and benefits (called motivations) associated with the principles behind the data mesh, making the book approachable also by non-professionals.

Built around four principles "data as a product", "domain-oriented ownership", "self-serve data platform" and "federated governance", the data mesh is the paradigm on which data as products are developed; where the products are "the smallest unit of architecture that can be independently deployed and managed", providing by design the information necessary to be discovered, understood, debugged, and audited.

It's possible to create Lego-like data products, data contracts and/or manifests that address product's usability characteristics, though unless the latter are generated automatically, put in the context of ERP and other complex systems, everything becomes quite an endeavor that requires time and adequate testing, increasing the overall timeframe until a data product becomes available.

The data mesh describes data products in terms of microservices that structure architectures in terms of a collection of services that are independently deployable and loosely coupled. Asking from data products to behave in this way is probably too hard a constraint, given the complexity and interdependency of the data models behind business processes and their needs. Does all the effort make sense? Is this the "agility" the data mesh solutions are looking for?

Many pioneering organizations are still fighting with the concept of data mesh as it proves to be challenging to implement. At a high level everything makes sense, but the way data products are expected to function makes the concept challenging to implement to the full extent. Moreover, as occasionally implied, the data mesh is about scaling data analytics solutions with the size and complexity of organizations. The effort makes sense when the organizations have a certain size and the departments have a certain autonomy, therefore, it might not apply to small to medium businesses.

Previous Post <<||>> Next Post

References:
[1] Zhamak Dehghani (2021) "Data Mesh: Delivering Data-Driven Value at Scale" (link)
[2] SQL-troubles (2024) Zhamak Dehghani's Data Mesh - Monolithic Warehouses and Lakes (link)

12 March 2024

🕸Systems Engineering: A Play of Problems (Much Ado about Nothing)

Disclaimer: This post was created just for fun. No problem was hurt or solved in the process!
Updated: 12-Jun-2024

On Problems

Everybody has at least a problem. If somebody doesn’t have a problem, he’ll make one. If somebody can't make a problem, he can always find a problem. One doesn't need to search long for finding a problem. Looking for a problem one sees more problems.

Not having a problem can easily become a problem. It’s better to have a problem than none. The none problem is undefinable, which makes it a problem.

Avoiding a problem might lead you to another problem. Some problems are so old, that's easier to ignore them.

In every big problem there’s a small problem trying to come out. Most problems can be reduced to smaller problems. A small problem may hide a bigger problem.

It’s better to solve a problem when is still small, however problems can be perceived only when they grow bigger (big enough).

In the neighborhood of a problem there’s another problem getting closer. Problems tend to attract each other.

Between two problems there’s enough place for a third to appear. The shortest path between two problems is another problem.

Two problems that appear together in successive situations might be the parts of the same problem.

A problem is more than the sum of its parts.

Any problem can be simplified to the degree that it becomes another problem.

The complementary of a problem is another problem. At the intersection/reunion of two problems lies another problem.

The inverse of a problem is another problem more complex than the initial problem.

Defining a problem correctly is another problem. A known problem doesn’t make one problem less.

When a problem seems to be enough, a second appears. A problem never comes alone. The interplay of the two problems creates a third.

Sharing the problems with somebody else just multiplies the number of problems.

Problems multiply beyond necessity. Problems multiply beyond our expectations. Problems multiply faster than we can solve them.

Having more than one problem is for many already too much. Between many big problems and an infinity of problems there seem to be no big difference.

Many small problems can converge toward a bigger problem. Many small problems can also diverge toward two bigger problems.

When neighboring problems exist, people tend to isolate them. Isolated problems tend to find other ways to surprise.

Several problems aggregate and create bigger problems that tend to suck within the neighboring problems.

If one waits long enough some problems will solve themselves or it will get bigger. Bigger problems exceed one's area of responsibility.

One can get credit for a self-created problem. It takes only a good problem to become famous.

A good problem can provide a lifetime. A good problem has the tendency to kick back where it hurts the most. One can fall in love with a good problem.

One should not theorize before one has a (good) problem. A problem can lead to a new theory, while a theory brings with it many more problems.

If the only tool you have is a hammer, every problem will look like a nail. (paraphrasing Abraham H Maslow)

Any field of knowledge can be covered by a set of problems. A field of knowledge should be learned by the problems it poses.

A problem thoroughly understood is always fairly simple, but unfairly complex. (paraphrasing Charles F Kettering)

The problem solver created usually the problem.

Problem Solving

Break a problem in two to solve it easier. Finding how to break a problem is already another problem. Deconstructing a problem to its parts is no guarantee for solving the problem.

Every problem has at least two solutions from which at least one is wrong. It’s easier to solve the wrong problem.

It’s easier to solve a problem if one knows the solution already. Knowing a solution is not a guarantee for solving the problem.

Sometimes a problem disappears faster than one can find a solution.

If a problem has two solutions, more likely a third solution exists.

Solutions can be used to generate problems. The design of a problem seldom lies in its solutions.

The solution of a problem can create at least one more problem.

One can solve only one problem at a time.

Unsolvable problems lead to problematic approximations. There's always a better approximation, one just needs to find it. One needs to be o know when to stop searching for an approximation.

There's not only a single way for solving a problem. Finding another way for solving a problem provides more insight into the problem. More insight complicates the problem unnecessarily.

Solving a problem is a matter of perspective. Finding the right perspective is another problem.

Solving a problem is a matter of tools. Searching for the right tool can be a laborious process.

Solving a problem requires a higher level of consciousness than the level that created it. (see Einstein) With the increase complexity of the problems one an run out of consciousness.

Trying to solve an old problem creates resistance against its solution(s).

The premature optimization of a problem is the root of all evil. (paraphrasing Donald Knuth)

A great discovery solves a great problem but creates a few others on its way. (paraphrasing George Polya)

Solving the symptoms of a problem can prove more difficult that solving the problem itself.

A master is a person who knows the solutions to his problems. To learn the solutions to others' problems he needs a pupil.

"The final test of a theory is its capacity to solve the problems which originated it." (George Dantzig) It's easier to theorize if one has a set of problems.

A problem is defined as a gap between where you are and where you want to be, though nobody knows exactly where he is or wants to be.

Complex problems are the problems that persist - so are minor ones.

"The problems are solved, not by giving new information, but by arranging what we have known since long." (Ludwig Wittgenstein, 1953) Some people are just lost in rearranging.

Solving problems is a practical skill, but impractical endeavor. (paraphrasing George Polya)

"To ask the right question is harder than to answer it." (Georg Cantor) So most people avoid asking the right question.

Solve more problems than you create.

They Said It

"A great many problems do not have accurate answers, but do have approximate answers, from which sensible decisions can be made." (Berkeley's Law)

"A problem is an opportunity to grow, creating more problems. [...] most important problems cannot be solved; they must be outgrown." (Wayne Dyer)

"A system represents someone's solution to a problem. The system doesn't solve the problem." (John Gall, 1975)

"As long as a branch of science offers an abundance of problems, so long is it alive." (David Hilbert)

"Complex problems have simple, easy to understand, wrong answers." [Grossman's Misquote]

"Every solution breeds new problems." [Murphy's laws]

"Given any problem containing n equations, there will be n+1 unknowns." [Snafu]

"I have not seen any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated." (Paul Anderson)

"If a problem causes many meetings, the meetings eventually become more important than the problem." (Hendrickson’s Law)

"If you think the problem is bad now, just wait until we’ve solved it." (Arthur Kasspe) [Epstein’s Law]

"Inventing is easy for staff outfits. Stating a problem is much harder. Instead of stating problems, people like to pass out half- accurate statements together with half-available solutions which they can't finish and which they want you to finish." [Katz's Maxims]

"It is better to do the right problem the wrong way than to do the wrong problem the right way." (Richard Hamming)

"Most problems have either many answers or no answer. Only a few problems have a single answer." [Berkeley's Law]

"Problems worthy of attack prove their worth by fighting back." (Piet Hein)

Rule of Accuracy: "When working toward the solution of a problem, it always helps if you know the answer."
Corollary: "Provided, of course, that you know there is a problem."

"Some problems are just too complicated for rational logical solutions. They admit of insights, not answers." (Jerome B Wiesner, 1963)

"Sometimes, where a complex problem can be illuminated by many tools, one can be forgiven for applying the one he knows best." [Screwdriver Syndrome]

"The best way to escape from a problem is to solve it." (Brendan Francis)

"The chief cause of problems is solutions." [Sevareid's Law]

"The first step of problem solving is to understand the existing conditions." (Kaoru Ishikawa)

"The human race never solves any of its problems, it only outlives them." (David Gerrold)

"The most fruitful research grows out of practical problems." (Ralph B Peck)

"The problem-solving process will always break down at the point at which it is possible to determine who caused the problem." [Fyffe's Axiom]

"The worst thing you can do to a problem is solve it completely." (Daniel Kleitman)

"The easiest way to solve a problem is to deny it exists." (Isaac Asimov)

"The solution to a problem changes the problem." [Peers's Law]

"There is a solution to every problem; the only difficulty is finding it." [Evvie Nef's Law]

"There is no mechanical problem so difficult that it cannot be solved by brute strength and ignorance. [William's Law]

"Today's problems come from yesterday’s 'solutions'." (Peter M Senge, 1990)

"While the difficulties and dangers of problems tend to increase at a geometric rate, the knowledge and manpower qualified to deal with these problems tend to increase linearly." [Dror's First Law]

"You are never sure whether or not a problem is good unless you actually solve it." (Mikhail Gromov)

More quotes on Problem solving at QuotableMath.blogpost.com.

Resources:
Murphy's laws and corollaries (link)

🏭🗒️Microsoft Fabric: OneLake [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 12-Mar-2024

Microsoft Fabric & OneLake

[Microsoft Fabric] OneLake

a single, unified, logical data lake for the whole organization [2]

designed to be the single place for all an organization's analytics data [2]
provides a single, integrated environment for data professionals and the business to collaborate on data projects [1]
stores all data in a single open format [1]
its data is governed by default
combines storage locations across different regions and clouds into a single logical lake, without moving or duplicating data

similar to how Office applications are prewired to use OneDrive
saves time by eliminating the need to move and copy data

comes automatically with every Microsoft Fabric tenant [2]

automatically provisions with no extra resources to set up or manage [2]
used as native store without needing any extra configuration [1

accessible by all analytics engines in the platform [1]

all the compute workloads in Fabric are preconfigured to work with OneLake

compute engines have their own security models (aka compute-specific security)

always enforced when accessing data using that engine [3]
the conditions may not apply to users in certain Fabric roles when they access OneLake directly [3]

built on top of ADLS [1]

supports the same ADLS Gen2 APIs and SDKs to be compatible with existing ADLS Gen2 applications [2]
inherits its hierarchical structure
provides a single-pane-of-glass file-system namespace that spans across users, regions and even clouds

data can be stored in any format

incl. Delta, Parquet, CSV, JSON
data can be addressed in OneLake as if it's one big ADLS storage account for the entire organization [2]

uses a layered security model built around the organizational structure of experiences within MF [3]

derived from Microsoft Entra authentication [3]
compatible with user identities, service principals, and managed identities [3]
using Microsoft Entra ID and Fabric components, one can build out robust security mechanisms across OneLake, ensuring that you keep your data safe while also reducing copies and minimizing complexity [3]

hierarchical in nature

{benefit} simplifies management across the organization
its data is divided into manageable containers for easy handling
can have one or more capacities associated with it

different items consume different capacity at a certain time
offered through Fabric SKU and Trials

{component} OneCopy

allows to read data from a single copy, without moving or duplicating data [1]

{concept} Fabric tenant

a dedicated space for organizations to create, store, and manage Fabric items.

there's often a single instance of Fabric for an organization, and it's aligned with Microsoft Entra ID [1]

⇒ one OneLake per tenant

maps to the root of OneLake and is at the top level of the hierarchy [1]

can contain any number of workspaces [2]

{concept} capacity

a dedicated set of resources that is available at a given time to be used [1]
defines the ability of a resource to perform an activity or to produce output [1]

{concept} domain

a way of logically grouping together workspaces in an organization that is relevant to a particular area or field [1]
can have multiple [subdomains]

{concept} subdomain

a way for fine tuning the logical grouping of the data

{concept} workspace

a collection of Fabric items that brings together different functionality in a single tenant [1]

different data items appear as folders within those containers [2]
always lives directly under the OneLake namespace [4]
{concept} data item

a subtype of item that allows data to be stored within it using OneLake [4]
all Fabric data items store their data automatically in OneLake in Delta Parquet format [2]

{concept} Fabric item

a set of capabilities bundled together into a single component [4]
can have permissions configured separately from the workspace roles [3]
permissions can be set by sharing an item or by managing the permissions of an item [3]

acts as a container that leverages capacity for the work that is executed [1]

provides controls for who can access the items in it [1]

security can be managed through Fabric workspace roles

enable different parts of the organization to distribute ownership and access policies [2]
part of a capacity that is tied to a specific region and is billed separately [2]
the primary security boundary for data within OneLake [3]

represents a single domain or project area where teams can collaborate on data [3]

[encryption] encrypted at rest by default using Microsoft-managed key [3]

the keys are rotated appropriately per compliance requirements [3]
data is encrypted and decrypted transparently using 256-bit AES encryption, one of the strongest block ciphers available, and it is FIPS 140-2 compliant [3]
{limitation} encryption at rest using customer-managed key is currently not supported [3]

{general guidance} write access

users must be part of a workspace role that grants write access [4]
rule applies to all data items, so scope workspaces to a single team of data engineers [4]

{general guidance}Lake access:

users must be part of the Admin, Member, or Contributor workspace roles, or share the item with ReadAll access [4]

{general guidance} general data access

any user with Viewer permissions can access data through the warehouses, semantic models, or the SQL analytics endpoint for the Lakehouse [4]

{general guidance} object level security:

give users access to a warehouse or lakehouse SQL analytics endpoint through the Viewer role and use SQL DENY statements to restrict access to certain tables [4]

{feature|preview} trusted workspace access

allows to securely access firewall-enabled Storage accounts by creating OneLake shortcuts to Storage accounts, and then use the shortcuts in the Fabric items [5]
based on [workspace identity]
{benefit} provides secure seamless access to firewall-enabled Storage accounts from OneLake shortcuts in Fabric workspaces, without the need to open the Storage account to public access [5]
{limitation} available for workspaces in Fabric capacities F64 or higher

{concept} workspace identity

a unique identity that can be associated with workspaces that are in Fabric capacities
enables OneLake shortcuts in Fabric to access Storage accounts that have [resource instance rules] configured
{operation} creating a workspace identity

Fabric creates a service principal in Microsoft Entra ID to represent the identity [5]

{concept} resource instance rules

a way to grant access to specific resources based on the workspace identity or managed identity [5]
{operation} create resource instance rules

created by deploying an ARM template with the resource instance rule details [5]

https://sql-troubles.blogspot.com/2024/03/microsoft-fabric-medallion-architecture.html

Previous Post <<||>> Next Post

Acronyms:

ADLS - Azure Data Lake Storage
AES - Advanced Encryption Standard

ARM - Azure Resource Manager

FIPS - Federal Information Processing Standard
SKU - Stock Keeping Units

References:
[1] Microsoft Learn (2023) Administer Microsoft Fabric (link)
[2] Microsoft Learn (2023) OneLake, the OneDrive for data (link)
[3] Microsoft Learn (2023) OneLake security (link)
[4] Microsoft Learn (2023) Get started securing your data in OneLake (link}
[5] Microsoft Fabric Updates Blog (2024) Introducing Trusted Workspace Access for OneLake Shortcuts, by Meenal Srivastva (link)

Resources:
[1]

SQL Troubles

Pages

22 March 2024

🧭Business Intelligence: Monolithic vs. Distributed Architecture (Part II: Architectural Choices)

20 March 2024

🗄️Data Management: Master Data Management (Part I: Understanding Integration Challenges) [Answer]

19 March 2024

📊R Language: Drawing Function Plots (Part II - Basic Curves & Inflection Points)

𖣯Strategic Management: Inflection Points and the Data Mesh (Quote of the Day)

18 March 2024

♟️Strategic Management: Strategy [Notes]

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

🧭Business Intelligence: Data Products (Part I: A Lego Exercise)

16 March 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part VII: Think for Yourself!)

15 March 2024

🧊🗒️Data Warehousing: Data Mesh [Notes]

14 March 2024

🧭Business Intelligence: Architecture (Part I: Monolithic vs. Distributed and Zhamak Dehghani's Data Mesh - Debunked)

13 March 2024

🔖Book Review: Zhamak Dehghani's Data Mesh: Delivering Data-Driven Value at Scale (2021)

12 March 2024

🕸Systems Engineering: A Play of Problems (Much Ado about Nothing)

🏭🗒️Microsoft Fabric: OneLake [Notes]

About Me