SQL Troubles

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider:

Area	Lakehouse	Warehouse	Eventhouse	Fabric SQL database	Power BI Datamart
Data volume	Unlimited	Unlimited	Unlimited	4 TB	Up to 100 GB
Type of data	Unstructured, semi-structured, structured	Structured, semi-structured (JSON)	Unstructured, semi-structured, structured	Structured, semi-structured, unstructured	Structured
Primary developer persona	Data engineer, data scientist	Data warehouse developer, data architect, data engineer, database developer	App developer, data scientist, data engineer	AI developer, App developer, database developer, DB admin	Data scientist, data analyst
Primary dev skill	Spark (Scala, PySpark, Spark SQL, R)	SQL	No code, KQL, SQL	SQL	No code, SQL
Data organized by	Folders and files, databases, and tables	Databases, schemas, and tables	Databases, schemas, and tables	Databases, schemas, tables	Database, tables, queries
Read operations	Spark, T-SQL	T-SQL, Spark*	KQL, T-SQL, Spark	T-SQL	Spark, T-SQL
Write operations	Spark (Scala, PySpark, Spark SQL, R)	T-SQL	KQL, Spark, connector ecosystem	T-SQL	Dataflows, T-SQL
Multi-table transactions	No	Yes	Yes, for multi-table ingestion	Yes, full ACID compliance	No
Primary development interface	Spark notebooks, Spark job definitions	SQL scripts	KQL Queryset, KQL Database	SQL scripts	Power BI
Security	RLS, CLS**, table level (T-SQL), none for Spark	Object level, RLS, CLS, DDL/DML, dynamic data masking	RLS	Object level, RLS, CLS, DDL/DML, dynamic data masking	Built-in RLS editor
Access data via shortcuts	Yes	Yes	Yes	Yes	No
Can be a source for shortcuts	Yes (files and tables)	Yes (tables)	Yes	Yes (tables)	No
Query across items	Yes	Yes	Yes	Yes	No
Advanced analytics	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Time Series native elements, full geo-spatial and query capabilities	T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics	Interface for data processing with automated performance tuning
Advanced formatting support	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Full indexing for free text and semi-structured data like JSON	Table support for OLTP, JSON, vector, graph, XML, spatial, key-value	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency	Available instantly for querying	Available instantly for querying	Queued ingestion, streaming ingestion has a couple of seconds latency	Available instantly for querying	Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag.

The warehouse seems to be more versatile, though careful attention needs to be given to its design.

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas.

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios. Starting with 28th of March 2025, SQL databases will be ON by default and tenant admins must manually turn them OFF before the respective date [3].

Power BI datamarts have been in preview for a couple of years.

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]

[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]

[3] Microsoft Fabric Update Blog (2025) Extending flexibility: default checkbox changes on tenant settings for SQL database in Fabric [link]

[4] Microsoft Fabric Update Blog (2025) Enhancing SQL database in Fabric: share your feedback and shape the future [link]

[5] Microsoft Fabric Update Blog (2025) Why SQL database in Fabric is the best choice for low-code/no-code Developers [link]

14 December 2024

🧭💹Business Intelligence: Perspectives (Part 21: Data Visualization Revised)

Data Visualization Series

Creating data visualizations nowadays became so easy that anybody can do it with a minimum of effort and knowledge, which on one side is great for the creators but can be easily become a nightmare for the readers, respectively users. Just dumping data in visuals can be barely called data visualization, even if the result is considered as such. The problems of visualization are multiple – the lack of data culture, the lack of understanding processes, data and their characteristics, the lack of being able to define and model problems, the lack of educating the users, the lack of managing the expectations, etc.

There are many books on data visualization though they seem an expensive commodity for the ones who want rapid enlightenment, and often the illusion of knowing proves maybe to be a barrier. It's also true that many sets of data are so dull, that the lack of information and meaning is compensated by adding elements that give a kitsch look-and-feel (aka chartjunk), shifting the attention from the valuable elements to decorations. So, how do we overcome the various challenges?

Probably, the most important step when visualizing data is to define the primary purpose of the end product. Is it to inform, to summarize or to navigate the data, to provide different perspectives at macro and micro level, to help discovery, to explore, to sharpen the questions, to make people think, respectively understand, to carry a message, to be artistic or represent truthfully the reality, or maybe is just a filler or point of attraction in a textual content?

Clarifying the initial purpose is important because it makes upfront the motives and expectations explicit, allowing to determine the further requirements, characteristics, and set maybe some limits in what concern the time spent and the qualitative and/or qualitative criteria upon which the end result should be eventually evaluated. Narrowing down such aspects helps in planning and the further steps performed.

Many of the steps are repetitive and past experience can help reduce the overall effort. Therefore, professionals in the field, driven by intuition and experience probably don't always need to go through the full extent of the process. Conversely, what is learned and done poorly, has high chances of delivering poor quality.

A visualization can be considered as effective when it serves the intended purpose(s), when it reveals with minimal effort the patterns, issues or facts hidden in the data, when it allows people to explore the data, ask questions and find answers altogether. One can talk also about efficiency, especially when readers can see at a glance the many aspects encoded in the visualization. However, the more the discovery process is dependent on data navigation via filters or other techniques, the more difficult it becomes to talk about efficiency.

Better criteria to judge visualizations is whether they are meaningful and useful for the readers, whether the readers understood the authors' intent, the further intrinsic implication, though multiple characteristics can be associated with these criteria: clarity, specificity, correctedness, truthfulness, appropriateness, simplicity, etc. All these are important in lower or higher degree depending on the broader context of the visualization.

All these must be weighted in the bigger picture when creating visualizations, though there are probably also exceptions, especially on the artistic side, where artists can cut corners for creating an artistic effect, though also in here the authors need to be truthful to the data and make sure that their work don't distort excessively the facts. Failing to do so might not have an important impact on the short term considerably, though in time the effects can ripple with unexpected effects.

Previous Post <<||>> Next Post

13 December 2024

🧭💹Business Intelligence: Perspectives (Part 20: From BI to AI)

Business Intelligence Series

No matter how good data visualizations, reports or other forms of BI artifacts are, they only serve a set of purposes for a limited amount of time, limited audience or any other factors that influence their lifespan. Sooner or later the artifacts become thus obsolete, being eventually disabled, archived and/or removed from the infrastructure.

Many artifacts require a considerable number of resources for their creation and maintenance over time. Sometimes the costs can be considerably higher than the benefits brought, especially when the data or the infrastructure are used for a narrow scope, though there can be other components that need to be considered in the bigger picture. Having a report or visualization one can use when needed can have an important impact on the business in correcting issues, sizing opportunities or filling the knowledge gaps.

Even if it’s challenging to quantify the costs associated with the loss of opportunities rooted in the lack of data, respectively information, the amounts can be considerable high, greater even than building a whole BI infrastructure. Organization’s agility in addressing the important gaps can make a considerable difference, at least in theory. Having the resources that can be pulled on demand can give organizations the needed competitive boost. Internal or external resources can be used altogether, though, pragmatically speaking, there will be always a gap between demand and supply of knowledgeable resources.

The gap in BI artefacts can be addressed nowadays by AI-driven tools, which have the theoretical potential of shortening the gap between needs and the availability of solutions, respectively a set of answers that can be used in the process. Of course, the processes of sense-making and discovery are not that simple as we’d like, though it’s a considerable step forward.

Having the possibility of asking questions in natural language and guiding the exploration process to create visualizations and other artifacts using prompt engineering and other AI-enabled methods offers new possibilities and opportunities that at least some organizations started exploring already. This however presumes the existence of an infrastructure on which the needed foundation can be built upon, the knowledge required to bridge the gap, respectively the resources required in the process.

It must be stressed out that the exploration processes may bring no sensible benefits, at least no immediately, and the whole process depends on organizations’ capabilities of identifying and sizing the respective opportunities. Therefore, even if there are recipes for success, each organization must identify what matters and how to use technologies and the available infrastructure to bridge the gap.

Ideally to make progress organizations need besides the financial resources the required skillset, a set of projects that support learning and value creation, respectively the design and execution of a business strategy that addresses the steps ahead. Each of these aspects implies risks and opportunities altogether. It will be a test of maturity for many organizations. It will be interesting to see how many organizations can handle the challenge, respectively how much past successes or failures will weigh in the balance.

AI offers a set of capabilities and opportunities, however the chance of exploring and failing fast is of great importance. AI is an enabler and not a magic wand, no matter what is preached in technical workshops! Even if progress follows an exponential trajectory, it took us more than half of century from the first steps until now and probably many challenges must be still overcome.

The future looks interesting enough to be pursued, though are organizations capable to size the opportunities, respectively to overcome the challenges ahead? Are organizations capable of supporting the effort without neglecting the other priorities?

Previous Post <<||>> Next Post

12 December 2024

🧭💹Business Intelligence: Perspectives (Part 19: Data Visualization between Art, Pragmatism and Kitsch)

Business Intelligence Series

The data visualizations (aka dataviz) presented in the media, especially the ones coming from graphical artists, have the power to help us develop what is called graphical intelligence, graphical culture, graphical sense, etc., though without a tutor-like experience the process is suboptimal because it depends on our ability of identifying what is important and which are the steps needed for decoding and interpreting such work, respectively for integrating their messages in our overall understanding about the world.

When such skillset is lacking, without explicit annotations or other form of support, the reader might misinterpret or fail to observe important visual cues even for simple visualizations, with all the implications deriving from this – a false understanding, and further aspects deriving from it, this being probably the most important aspect to consider. Unfortunately, even the most elaborate work can fail if the reader doesn’t have a basic understanding of all that’s implied in the process.

The books of Willard Brinton, Ana Rogers, Jacques Bertin, William Cleveland, Leland Wilkinson, Stephen Few, Albert Cairo, Soctt Berinato and many others can help the readers build a general understanding of the dataviz process and how data visualizations or simple graphics can be used/misused effectively, though each reader must follow his/her own journey. It’s also true that the basics can be easily learned, though the deeper one dives, the more interesting and nontrivial the journey becomes. Fortunately, the average reader can stick to the basics and many visualizations are simple enough to be understood.

To grasp the full extent of the implications, one can make comparisons with the domain of poetry where the author uses basic constructs like metaphor, comparisons, rhythm and epithets to create, communicate and imprint in reader’s mind old and new meanings, images and feelings altogether. Artistic data visualizations tend to offer similar charge as poetry does, even if the impact might not appeal so much to our artistic sensibility. Though dataviz from this perspective is or at least resembles an art form.

Many people can write verses, though only a fraction can write good meaningful poetry, from which a smaller fraction get poems, respectively even fewer get books published. Conversely, not everything can be expressed in verses unless one finds good metaphors and other aspects that can be leveraged in the process. Same can be said about good dataviz.

One can argue that in dataviz the author can explore and learn especially by failing fast (seeing what works and what doesn’t). One can also innovate, though the creator has probably a limited set of tools and rules for communication. Enabling readers to see the obvious or the hidden in complex visualizations or contexts requires skill and some kind of mastery of the visual form.

Therefore, dataviz must be more pragmatic and show the facts. In art one has the freedom to distort or move things around to create new meanings, while in dataviz it’s important for the meaning to be rooted in 'truth', at least by definition. The more the creator of a dataviz innovates, the higher the chances of being misunderstood. Moreover, readers need to be educated in interpreting the new meanings and get used to their continuous use.

Kitsch is a term applied to art and design that is perceived as naïve imitation to the degree that it becomes a waste of resources even if somebody pays the tag price. There’s a trend in dataviz to add elements to visualizations that don’t bring any intrinsic value – images, colors and other elements can be misused to the degree that the result resembles kitsch, and the overall value of the visualization is diminished considerably.

Previous Post <<||>> Next Post

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 8-Dec-2024

Microsoft Fabric

{goal}complete (end-to-end) analytics platform [6]

{characteristic} unified

{objective} provides a single, integrated environment for all the organization

{benefit} data professionals and the business users can collaborate on data projects [5] and solutions

{characteristic}serverless SaaS model (aka SaaS-ified)

{objective} provisioned automatically with the tenant [6]
{objective} highly scalable [5]
{objective} cost-effectiveness [5]
{objective} accessible

⇐ from anywhere with an internet connection [5]

{objective} continuous updates

⇐ provided by Microsoft

{objective} continuous maintenance

⇐ provided by Microsoft

provides a set of integrated services that enable to ingest, store, process, and analyze data in a single environment [5]

{objective} secure
{objective} governed

{goal} lake-centric

{characteristic} OneLake-based

all workloads automatically store their data in the OneLake workspace folders [6]
all the data is organized in an intuitive hierarchical namespace [6]
data is automatically indexed [6]
provides a set of features

discovery
MIP labels
lineage
PII scans
sharing
governance
compliance

{characteristic} one copy

available for all computes
all compute engines store their data automatically in OneLake

⇐ the data is stored in a (single) common format

⇐ delta parquet file format

open standards format
the storage format for all tabular data in Microsoft Fabric

⇐ the data is directly accessible by all the engines [6]

⇐ no import/export needed

all compute engines are fully optimized to work with Delta Parquet as their native format [6]
a shared universal security model is enforced across all the engines [6]

{characteristic} open at every tier

{goal} empowering

{characteristic} intuitive
{characteristic} built into M365
{characteristic} insight to action

{goal} AI-powered

{characteristic} Copilot accelerated
{characteristic} ChatGPT enabled
{characteristic} AI-driven insights

complete analytics platform

addresses the needs of all data professionals and business users who target harnessing the value of data

{feature} scales automatically

the system automatically allocates an appropriate number of compute resources based on the job size
the cost is proportional to total resource consumption, rather than size of cluster or number of resources allocated
jobs in general complete faster (and usually, at less overall cost)

⇒ not need to specify cluster sizes

natively supports

Spark
data science
log-analytics
real-time ingestion and messaging
alerting
data pipelines
Power BI reporting
interoperability with third-party services

from other vendors that support the same open

data virtualization mechanisms

{feature} mirroring [notes]
{feature} shortcuts [notes]

allow users to reference data without copying it
{benefit} make other domain data available locally without the need for copying data

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
associated with a domain [3]

{feature} domains [notes]

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
subdomains

a way for fine tuning the logical grouping data under a domain [1]

subdivisions of a domain

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric [link]
[2] Microsoft Learn: Fabric (2024) Governance overview and guidance [link]
[3] Microsoft Learn: Fabric (2023) Fabric domains [link]
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam [link]
[5] Microsoft Learn: Fabric (2024) Introduction to end-to-end analytics using Microsoft Fabric [link]
[6] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
API - Application Programming Interface
M365 - Microsoft 365
MF - Microsoft Fabric
PII - Personal Identification Information
SaaS - software-as-a-service

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 1-Apr-2024

Delta Lake

{def} an optimized open source storage layer that runs on top of a data lake [1]

the default storage format in a Fabric lakehouse [1]
stores data in Parquet file format
is a variant of log-structured files
initially developed at Databricks
fully compatible with Apache Spark APIs [1]

{characteristic} high reliability
{characteristic} secure
{characteristic} performant

provides low latency

{feature} data indexing

indexes are created and maintained on the ingested data [1]

increases the querying speed significantly [1]

{feature} data skipping

file statistics are maintains so that data subsets relevant to the query are used instead of entire tables - this partition pruning avoids processing data that is not relevant to the query [1]
helps complex queries to read only the relevant subsets to fulfil query [1]

{feature} multidimensional clustering

uses the Z-ordering algorithm
enables data skipping this.

{feature} compaction

compacts or combines multiple small files into more efficient larger ones [1]

speeds up query performance

storing and accessing small files can be processing-intensive, slow and inefficient from a storage utilization perspective [1]

{feature} data caching

highly accessed data is automatically cached to speed access for queries

{feature} ACID transactions

"all or nothing" ACID transaction approach is employed to prevent data corruption

⇐ partial or failed writes risk corrupting the data [1]

{feature} snapshot isolation (aka SI)

ensures that multiple writers can write to a dataset simultaneously without interfering with jobs reading the dataset [1]

{feature} schema enforcement

data can be stored using a schema
{benefit} helps ensure data integrity for ingested data by providing schema enforcement [1]

potential data corruption with incorrect or invalid schemas is avoided [1]

{feature} transaction log checkpoints (aka checkpointing)

optimize checkpoint querying

the table versions are aggregated to Parquet checkpoint files [7]

[Azure Databricks] optimizes checkpointing frequency for data size and workload [7]
employed to provide a robust exactly once delivery semantic [1]
{benefit} ensures that data is neither missed nor repeated erroneously [1]
prevents the need to read all JSON versions of table history [7]
checkpoint frequency is subject to change [7]
manages log file removal automatically after checkpointing table versions [7]

{feature} UPSERTS and DELETES support

provide a more convenient way of dealing with such changes [1]

{feature} unified streaming and batch data processing

both batch and streaming data are handled via a direct integration with Structured Streaming for low latency updates [1]

{benefit} simplifies the system architecture [1]
{benefit} results in shorter time from data ingest to query result [1]

can concurrently write batch and streaming data to the same data table [1]

{feature} schema evolution

schema is inferred from input data
{benefit} reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/data stack [1]

{feature} scalable metadata handling
{feature} predictive optimization

removes the need to manually manage maintenance operations for delta tables [8]
{enabled} automatically identifies tables that would benefit from maintenance operations, and then optimizes their storage

{feature} historical retention

{default} maintains a history of all changes made [4]
{benefit} enhanced regulatory compliance and audit
{recommendation} keep historical data only for their intended lifecycle [AN]

allows to reduce storage costs [4]

{feature} time travel

supports querying previous table versions based on timestamp or table version [7]

as recorded in the transaction log [7]

{benefit} support for data rollback
{benefit} lets users query point-in-time snapshots [5]
- {scenario} recreating analyses, reports, or outputs for debugging or auditing [7]
- {scenario} writing complex temporal queries [7]
- {scenario} fixing data mistakes [7]
- {scenario} providing snapshot isolation for a set of queries for fast changing tables [7]
table versions accessible with time travel are determined by a combination of

the retention threshold for transaction log files [7]
the frequency and specified retention for VACUUM operations [7]

if VACUUM is run daily with the default values, 7 days of data is available for time travel [7]
{best practice} all writes and reads should go through Delta Lake [1]

{benefit} ensure consistent overall behavior [1]

{operation} restore

restores a table to its earlier state

via RESTORE command

one can restore an already restored table [7]
one can restore a cloned table [7]
{permissions} MODIFY permission needed on the table being restored [7]
{exception} one can’t restore a table to an older version where the data files were deleted manually or by vacuum [7]
considered a data-changing operation

Delta Lake log entries added by the RESTORE command contain dataChange set to true [7]

{operation} recreate table

when deleting and recreating a table in the same location, always use a CREATE OR REPLACE TABLE statement.{best practice}run OPTIMIZE Regularly

{exception} should not be run on base or staging tables [1]

{best practice} run VACUUM Regularly

cleans up expired snapshots that are no longer required [1]

{best practice} use MERGE INTO to batch changes

{benefit} allows to efficiently rewrite queries to implement updates to archived data and compliance workflows [5]

{best practice} use DELETE commands

{benefit} ensures proper progression of the change [1]
{warning} manually deleting files from the underlying storage is likely to break the table [1]

Previous Post <<||>> Next Post

References:
[1] Azure Databricks (2023) Delta Lake on Azure Databricks
[2] Josep Aguilar-Saborit et al, POLARIS: The Distributed SQL Engine in Azure Synapse, PVLDB, 13(12), 2020 (link)
[3] Josep Aguilar-Saborit et al, Extending Polaris to Support Transactions 2024
[4] Implement medallion lakehouse architecture in Microsoft Fabric (link)
[5] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[6] Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake
[7] Microsoft Learn (2024) Best practices: Delta Lake [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ACID - atomicity, consistency, isolation, durability

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 29-May-2025

[Microsoft Fabric] Shortcut

{def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access

serves as virtual pointer to data stored in other locations [6]
{goal} unifies existing data without copying or moving it [2]

⇒ data can be used multiple times without being duplicated [2]
{benefit} helps to eliminate edge copies of data [1]
{benefit} reduces process latency associated with data copies and staging [1]

is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]

⇒ allows creating a single virtual data lake for the entire enterprise [1]
⇐ available in all Fabric experiences [1]
⇐ behave like symbolic links [1]

independent object from the target [1]
appear as folder [1]
can be used by workloads or services that have access to OneLake [1]
transparent to any service accessing data through the OneLake API [1]

can point to

OneLake locations
ADLS Gen2 storage accounts
Amazon S3 storage accounts
Dataverse
on-premises or network-restricted locations via PDF

{capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
{capability} data can be compose throughout OneLake without any data movement [2]
{capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]

⇐ makes OneLake the first multi-cloud data lake [2]

{capability} provides support for industry standard APIs

⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]

{operation} creating a shortcut

can be created in

lakehouses
KQL databases

⇐ shortcuts are recognized as external tables [1]

can be created via

Fabric UI
REST API

can be created across items [1]

the item types don't need to match [1]

e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]

[lakehouse] tables folder

represents the managed portion of the lakehouse

shortcuts can be created only at the top level [1]

⇒ shortcuts aren't supported in other subdirectories [1]

if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]

[lakehouse] files folder

represents the unmanaged portion of the lakehouse [1]
there are no restrictions on where shortcuts can be created [1]

⇒ can be created at any level of the folder hierarchy [1]
⇐ table discovery doesn't happen in the Files folder [1]

[lakehouse] all shortcuts are accessed in a delegated mode when querying through the SQL analytics endpoint [5]

the delegated identity is the Fabric user that owns the lakehouse [5]

{default} the owner is the user that created the lakehouse and SQL analytics endpoint [5]

⇐ can be changed in select cases
the current owner is displayed in the Owner column in Fabric when viewing the item in the workspace item list

⇒ the querying user is able to read from shortcut tables if the owner has access to the underlying data, not the user executing the query [5]

⇐ the querying user only needs access to select from the shortcut table [5]

{feature} OneLake data access roles

{enabled} access to a shortcut is determined by whether the SQL analytics endpoint owner has access to see the target lakehouse and read the table through a OneLake data access role [5]
{disabled} shortcut access is determined by whether the SQL analytics endpoint owner has the Read and ReadAll permission on the target path [5]

{operation} renaming a shortcut
{operation} moving a shortcut
{operation} deleting a shortcut

doesn't affect the target [1]

⇐ only the shortcut object is deleted [1]

shortcuts don't perform cascading deletes [1]
moving, renaming, or deleting a target path can break the shortcut [1]

{operation} delete file/folder

file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]

{permissions} users must have permissions in the target location to read the data [1]

when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1] (aka passthrough auth model [6])

ensures that any user accessing the shortcut is only able to see whatever they have access to in the target [6]
the security from the target ‘flows across’ the shortcut to restrict access in the source lakehouse [6]
OneLake to OneLake shortcuts support only passthrough mode [6]

ensures that the source system retains full control over its data [6]

⇐ there’s no need to replicate or redefine access controls for the shortcut [6]
{benefit} reduces administrative overhead since security policies only need to be maintained in one place [6]
{constraint} security cannot be modified directly from the downstream item [6]

ensures that the source system retains full control over its data [6]

any changes to access permissions must be made at the source location [6]
the source remains the single point of truth for access control [6]

⇐ ensures consistency
⇐ minimes the risk of misconfiguration [6]

{type} delegated auth mode

shortcuts access data by using some intermediate credential

e.g. another user or an account key
allow for permission management to be separated or ‘delegated’ to another team or downstream user to manage [6]

always break the flow of security from one system to another [6]
all delegated shortcuts in OneLake can have OneLake security roles defined for them [6]

all shortcuts from OneLake to external systems are delegated [6]

e.g. AWS S3 or Google Cloud Storage
allows users to connect to the external system without being given direct access [6]
OneLake security can then be configured on the shortcut to limit what data in the external system can be accessed [6]

when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]

the calling item owner’s identity is passed instead, delegating access to the calling user [1]

OneLake manages all permissions and credentials

{type} OneLake to OneLake shortcuts

ideal for ensuring the hub retains control over sensitive or regulated data [6]

each downstream team

can then only consume the data they are allowed to [6]
has the freedom to create its own reports or combine the hub data with other data that they own [6]

{concept} hub-and-spoke model

allows to manage the data access across multiple teams or departments [6]
{component} hub

the central data repository where core datasets are stored [6]
security policies are meticulously defined to ensure robust control [6]

{component} spokes

individual teams or departments access the hub’s data through shortcuts [6]

{advantage} enables centralized governance while allowing decentralized consumption and use of data [6]
can be leveraged in various ways to create efficient and secure data architectures [6]

{type} delegated shortcuts

allow to share data securely centralize data across clouds, without copying it [6]

the data that already exists in various cloud storage accounts is consolidated in OneLake through the use of delegated shortcuts [6]
a new lakehouse is created as the consolidation point [6]
each external data source is connected via a delegated shortcut [6]

the admin can define OneLake security roles to govern access
granularity: row, column, schemas or shortcuts [6]

⇒ no user will have direct access to the external data ⇐ they will be limited to only what the admin allows through OneLake security [6]
⇐ once the data is consolidated, it can be combined with the hub-and-spoke model to create a composite architecture that keeps both upstream and downstream data safe [6]

{feature} shortcut caching

{def} mechanism used to reduce egress costs associated with cross-cloud data access [1]

when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]

subsequent read requests are served from cache rather than the remote storage provider [1]
cached files have a retention period of 24 hours
each time the file is accessed the retention period is reset [1]
if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]

{restriction} individual files greater than 1 GB in size are not cached [1]
{restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]

{feature} query acceleration

caches data as it lands in OneLake, providing performance comparable to ingesting data in Eventhouse [4]

{limitation} maximum number of shortcuts [1]

per Fabric item: 100,000
in a single OneLake path: 10
direct shortcuts to shortcut links: 5

{limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
{limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
{limitation} shortcuts don't support non-Latin characters[1]
{limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
{limitation} copy function doesn't work on shortcuts that directly point to ADLS containers

{recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]

{limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
{limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
{limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]
introduce unique considerations when it comes to security [6]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Fabric: OneLake shortcuts [link]
[2] Microsoft Learn (2024) Fabric Analyst in a Day [course notes]

[3] Microsoft Learn (2024) Use OneLake shortcuts to access data across capacities: Even when the producing capacity is paused! [link]

[4] Microsoft Learn (2024) Fabric: Query acceleration for OneLake shortcuts - overview (preview) [link]

[5] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

[6] Microsoft Fabric Update Blog (2025) Understanding OneLake Security with Shortcuts [link]

Acronyms:

ADLS - Azure Data Lake Storage
API - Application Programming Interface

AWS - Amazon Web Services

GCS - Google Cloud Storage

KQL - Kusto Query Language

OPDG - on-premises data gateway

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

Data Warehousing Series

One interesting announcements at Ignite is the availability in public preview of SQL databases in Microsoft Fabric, "a versatile and developer-friendly transactional database built on the foundation of Azure SQL database". With this Fabric can address besides OLAP also OLTP scenarios, evolving thus from analytics to a data platform [1]. According to the announcement, besides the AI-optimized architectural aspects, the feature makes the SQL Azure simple, autonomous and secure by design [1], and these latest aspects are considered in this post.

Simplicity revolves around the deployment and configuration of databases, the creation of a new database requiring giving a name and the database is created in seconds [1]. It’s a considerable improvement compared with the relatively complex setup needed for on-premise configurations, though sometimes more flexibility in configuration is needed upfront or over database’s lifetime. To get a database ready for testing one can import a sample database or get specific data via data flows and/or pipelines [1]. As development tools one can use Visual Studio Code or SSMS [1], and probably more tools will be available in time.

The integration with both GitHub and Azure DevOps allows to configure each database under source control, which is needed for many scenarios especially when multiple resources make changes to the database objects [1]. Frankly, that’s mainly important during the development phase, respectively in scenarios in which multiple people make in parallel changes to the logic. It will be interesting to see how much overhead or challenges the feature adds to development and how smoothly everything works together!

The most important aspect for many solutions is the replication of data in near-real time to the (open-source) delta parquet format in OneLake and thus making the data available for analytics almost immediately [1]. Probably, from this aspect many cloud-based applications can benefit, even if the performance might not be as good as in other well-established architectures. However, there are many other scenarios in which one needs to maintain and use data for OLTP/OLAP purposes. This invites adequate testing and a good weighting of the advantages and disadvantages involved.

A SQL database is a native item in Fabric, and therefore it utilizes Fabric capacity units like other Fabric workloads [1]. One can use the Fabric SKU estimator (still in private preview) to estimate the costs [2], though it will be interesting to see how cost-effective the solutions are. Probably, especially when the infrastructure is already available outside of Fabric, it will be easier and cost-effective to use the mirroring functionality. One should test and have a better estimator before moving blindly from the existing infrastructure to Fabric.

SQL databases in Fabric are autonomous by design, while allowing to get the best performance and availability by default [1]. High availability is reached through zone redundancy, while performance is achieved by scaling automatically the storage and compute to accommodate the workloads [1]. The auto-optimization capability is achieved with the help of the latest Intelligent Query Processing (IQP) enhancements, respectively the creation of missing indexes to improve query performance [1]. It will be interesting to see how the whole process works, given that the maintenance of indexes usually involves some challenges (e.g. identifying covering indexes, indexes needed only for temporary workloads, duplicated indexes).

SQL databases in Fabric are automatically configured for high availability with zone redundancy, while storage and compute scale automatically to accommodate the user workload [1]. The database is auto-optimized through the latest IQP enhancements while the system creates any missing indexes to improve query performance. All data is replicated to OneLake by default [1]. Finally, the database always receives the latest security updates with auto-patching, while automatic backups help in disaster recovery scenarios [1], which can be of real help for database administrators.

References:
[1] Microsoft Fabric Updates Blog (2024) Announcing SQL database in Microsoft Fabric Public Preview [link]
[2] Microsoft Fabric Updates Blog (2024) Announcing New Recruitment for the Private Preview of Microsoft Fabric SKU Estimator [link]

10 November 2024

🏭🗒️Microsoft Fabric: Data Mesh [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 23-May-2024

[Microsoft Fabric] Data Mesh

{definition} a type of decentralized data architecture that organizes data based on different business domains [2]

⇐ a centrally managed network of decentralized data products

{concept} landing zone

typically a subscription that needs to be governed by a common policy [7]

{downside} creating one landing zone for every project can lead to too many landing zones to manage

{alternative} landing zones based on a business domain [7]

resources must be managed efficiently in a way that each team is given access to only their resources [7]

⇐ shared resources might be need with separate management and common access to all [7]

need to be linked together into a mesh

via peer-to-peer networks

{concept} connectivity hub
{feature} resource group

{definition} a container that holds related resources for an Azure solution
can be associated with a data product

when the data product becomes obsolete, the resource group can be deleted [7]

{feature} subscription

{definition} a logical unit of Azure services that are linked to an Azure account
can be associated as a landing zone governed by a policy [7]

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
associated with a domain [3]

{feature} domains

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
some tenant-level settings for managing and governing data can be delegated to the domain level [2]

{feature} subdomains
- a way for fine tuning the logical grouping data under a domain [1]
- subdivisions of a domain
{concept} deployment template

Previous Post <<||>> Next Post

References

[1] Microsoft Learn: Fabric (2023) Fabric domains (link)

[2] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link)

[3] Data mesh: A perspective on using Azure Synapse Analytics to build data products, by Amanjeet Singh (link)

[4] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale

[5] Marthe Mengen (2024) How do you set up your Data Mesh in Microsoft Fabric? (link)

[6] Administering Microsoft Fabric - Considering Data Products vs Domains vs Workspaces, by Paul Andrew (link)

[7] Aniruddha Deswandikar (2024) Engineering Data Mesh in Azure Cloud

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

🏭🗒️Microsoft Fabric: Warehouse [Notes]

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Warehouse

{def} highly available relational data warehouse that can be used to store and query data in the Lakehouse

supports the full transactional T-SQL capabilities
modernized version of the traditional data warehouse
unifies capabilities from Synapse Dedicated and Serverless SQL Pools

resources are managed elastically to provide the best possible performance

⇒ no need to think about indexing or distribution
a new parser gives enhanced CSV file ingestion time
metadata is cached in addition to data
improved assignment of compute resources to milliseconds
multi-TB result sets are streamed to the client

leverages a distributed query processing engine

provides with workloads that have a natural isolation boundary [3]

true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]

{operation} data ingestion

involves moving data from source systems into the data warehouse [2]

the data becomes available for analysis [1]

via Pipelines, Dataflows, cross-database querying, COPY INTO command
no need to copy data from the lakehouse to the data warehouse [1]

one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]

{operation} data storage

involves storing the data in a format that is optimized for analytics [2]

{operation} data processing

involves transforming the data into a format that is ready for consumption by analytical tools [1]

{operation} data analysis and delivery

involves analyzing the data to gain insights and delivering those insights to the business [1]

{operation} designing a warehouse (aka warehouse design)

standard warehouse design can be used

{operation} sharing a warehouse (aka warehouse sharing)

a way to provide users read access to the warehouse for downstream consumption

via SQL, Spark, or Power BI

the level of permissions can be customized to provide the appropriate level of access

{feature} mirroring

provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric

any database can be accessed and managed centrally from within Fabric without having to switch database clients
data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload

{feature} v-order

write time optimization to the parquet file format that enables lightning-fast reads under the MF compute engine [5]

{feature} caching

stores frequently accessed data and metadata in a faster storage layer [6]

{feature} snapshots

{def} read-only representation of a warehouse at a specific point in time [7]

{feature} automatic purging

routinely and systematically eliminating expired data periodically [8]
{benefit} proactively helps maintain an efficient and cost-effective data infrastructure [8]

via garbage collection

background process, that periodically identifies and cleans [8]

all the data and log files of dropped tables
aborted transactions
temporary tables
expired files

executes every 24 hours, when the warehouse is active [8]
ensures the data warehouse remains optimized and efficient [8]

{goal} storage cost optimization
{goal} minimize maintenance overhead
{goal} adhering to data retention regulations

{concept} SQL analytics endpoint

a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]

{concept} virtual warehouse

can containing data from virtually any source by using shortcuts [3]

{concept} cross database querying

enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2023) Fabric: Get started with data warehouses in Microsoft Fabric (link)

[2] Microsoft Learn (2023) Fabric: Microsoft Fabric decision guide: choose a data store (link)
[3] Microsoft Learn (2024) Fabric: What is data warehousing in Microsoft Fabric? (link)

[4] Microsoft Learn (2023) Fabric: Better together: the lakehouse and warehouse (link)

[5] Microsoft Learn (2024) Fabric: Understand V-Order for Microsoft Fabric Warehouse [link]

[6] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]

[7] Microsoft Learn (2024) Fabric: Warehouse Snapshots in Microsoft Fabric (Preview) [link]

[8] Microsoft Fabric Updates Blog (2025) Intelligent Data Cleanup: Smart Purging for Smarter Data Warehouses [link]

Resources:
[R1] Microsoft Learn (2023) Fabric: Data warehousing documentation in Microsoft Fabric (link)

[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]
[R4] Microsoft Learn (2025) Microsoft Fabric decision guide: Choose between Warehouse and Lakehouse [link]

Acronyms:

ETL - Extract, Transfer, Load

MF - Microsoft Fabric

16 October 2024

🧭💹Business Intelligence: Perspectives (Part 18: There’s More to Noise)

Business Intelligence Series

Visualizations should be built with an audience's characteristics in mind! Upon case, it might be sufficient to show only values or labels of importance (minima, maxima, inflexion points, exceptions, trends), while other times it might be needed to show all or most of the values to provide an accurate extended perspective. It even might be useful to allow users switching between the different perspectives to reduce the clutter when navigating the data or look at the patterns revealed by the clutter.

In data-based storytelling are typically shown the points, labels and further elements that support the story, the aspects the readers should focus on, though this approach limits the navigability and users’ overall experience. The audience should be able to compare magnitudes and make inferences based on what is shown, and the accurate decoding shouldn’t be taken as given, especially when the audience can associate different meanings to what’s available and what’s missing.

In decision-making, selecting only some well-chosen values or perspectives to show might increase the chances for a decision to be made, though is this equitable? Cherry-picking may be justified by the purpose, though is in general not a recommended practice! What is not shown can be as important as what is shown, and people should be aware of the implications!

One person’s noise can be another person’s signal. Patterns in the noise can provide more insight compared with the trends revealed in the "unnoisy" data shown! Probably such scenarios are rare, though it’s worth investigating what hides behind the noise. The choice of scale, the use of special types of visualizations or the building of models can reveal more. If it’s not possible to identify automatically such scenarios using the standard software, the users should have the possibility of changing the scale and perspective as seems fit.

Identifying patterns in what seems random can prove to be a challenge no matter the context and the experience in the field. Occasionally, one might need to go beyond the general methods available and statistical packages can help when used intelligently. However, a presenter’s challenge is to find a plausible narrative around the findings and communicate it further adequately. Additional capabilities must be available to confirm the hypotheses framed and other aspects related to this approach.

It's ideal to build data models and a set of visualizations around them. Most probable some noise may be removed in the process, while other noise will be further investigated. However, this should be done through adjustable visual filters because what is removed can be important as well. Rare events do occur, probably more often than we are aware and they may remain hidden until we find the right perspective that takes them into consideration.

Probably, some of the noise can be explained by special events that don’t need to be that rare. The challenge is to identify those parameters, associations, models and perspectives that reveal such insights. One’s gut feeling and experience can help in this direction, though novel scenarios can surprise us as well.

Not in every set of data one can find patterns, respectively a story trying to come out. Whether we can identify something worth revealing depends also on the data available at our disposal, respectively on whether the chosen data allow identifying significant patterns. Occasionally, the focus might be too narrow, too wide or too shallow. It’s important to look behind the obvious, to look at data from different perspectives, even if the data seems dull. It’s ideal to have the tools and knowledge needed to explore such cases and here the exposure to other real-life similar scenarios is probably critical!

Previous Post <<||>> Next Post

𖣯Strategic Management: Strategic Perspectives (Part II: The Elephant in the Room)

Strategic Management Perspectives

There’s an ancient parable about several blind people who touch a shape they had never met before, an elephant, and try to identify what it is. The elephant is big, more than each person can sense through direct experience, and people’s experiences don’t correlate to the degree that they don’t trust each other, the situation escalating upon case. The moral of the parable is that we tend to claim (absolute) truths based on limited, subjective experience [1], and this can easily happen in business scenarios in which each of us has a limited view of the challenges we are facing individually and as a collective.

The situation from the parable can be met in business scenarios, when we try to make sense of the challenges we are faced with, and we get only a limited perspective from the whole picture. Only open dialog and working together can get us closer to the solution! Even then, the accurate depiction might not be in sight, and we need to extrapolate the unknown further.

A third-party consultant with experience might be the right answer, at least in theory, though experience and solutions are relative. The consultant might lead us in a direction, though from this to finding the answer can be a long way that requires experimentation, a mix of tactics and strategies that change over time, more sense-making and more challenges lying ahead.

We would like a clear answer and a set of steps that lead us to the solution, though the answer is as usual, it depends! It depends on the various forces/drivers that have the biggest impact on the organization, on the context, on the organization’s goals, on the resources available directly or indirectly, on people’s capabilities, the occurrences of external factors, etc.

In many situations the smartest thing to do is to gather information, respectively perspectives from all the parties. Tools like brainstorming, SWOT/PESTLE analysis or scenario planning can help in sense-making to identify the overall picture and where the gravity point lies. For some organizations the solution will be probably a new ERP system, or the redesign of some processes, introduction of additional systems to track quality, flow of material, etc.

A new ERP system will not necessarily solve all the issues (even if that’s the expectation), and some organizations just try to design the old processes into a new context. Process redesign in some areas can be upon case a better approach, at least as primary measure. Otherwise, general initiatives focused on quality, data/information management, customer/vendor management, integrations, and the list remains open, can provide the binder/vehicle an organization needs to overcome the current challenges.

Conversely, if the ERP or other strategical systems are 10-20 years old, then there’s indeed an elephant in the room! Moreover, the elephant might be bigger than we can chew, and other challenges might lurk in its shadow(s). Everything is a matter of perspective with no apparent unique answer. Thus, finding an acceptable solution might lurk in the shadow of the broader perspective, in the cumulated knowledge of the people experiencing the issues, respectively in some external guidance. Unfortunately, the guides can be as blind as we are, making limited or no important impact.

Sometimes, all it’s needed is a leap of faith corroborated with a set of tactics or strategies kept continuously in check, redirected as they seem fit based on the knowledge accumulated and the challenges ahead. It helps to be aware of how others approached the same issues. Unfortunately, there’s no answer that works for all! In this lies the challenge, in identifying what works and makes sense for us!

Previous Post <<||>> Next Post

Resources:
[1] Wikipedia (2024) Blind men and an elephant [link]

15 October 2024

🗄️Data Management: Data Governance (Part III: Taming the Complexity)

Data Management Series

The Chief Data Officer (CDO) or the “Head of the Data Team” is one of the most challenging jobs because is more of a "political" than a technical role. It requires the ideal candidate to be able to throw and catch curved balls almost all the time, and one must be able to play ball with all the parties having an interest in data (aka stakeholders). It’s a full-time job that requires the combination of management and technical skillsets, and both are important! The focus will change occasionally in one direction more than in the other, with important fluctuations.

Moreover, even if one masters the technical and managerial aspects, the combination of the two gives birth to situations that require further expertise – applied systems thinking being probably the most important. This, also because there are so many points of failure that it's challenging to address all the important causes. Therefore, it’s critical to be a system thinker, to have an experienced team and make use adequately of its experience!

In a complex word, in which even the smallest constraint or opportunity can have an important impact especially when it’s involved in the early stages of the processes taking place in organizations. It relies on the manager’s and team’s skillset, their inspiration, the way the business reacts to the tasks involved and probably many other aspects that make things work. It takes considerable effort until the whole mechanism works, and even more time to make things work efficiently. The best metaphor is probably the one of a small combat team in which everybody has their place and skillset in the mechanism, independently if one talks about strategy, tactics or operations.

Unfortunately, building such teams takes time, and the more people are involved, the more complex this endeavor becomes. The manager and the team must meet somewhere in the middle in what concerns the philosophy, the execution of the various endeavors, the way of working together to achieve the same goals. There are multiple forces pulling in all directions and it takes time until one can align the goals, respectively the effort.

The most challenging forces are the ones between the business and the data team, respectively the business and data requirements, forces that don’t necessarily converge. Working in small organizations, the two parties have in theory more challenges to overcome the challenges and a team’s experience can weight a lot in the process, though as soon the scale changes, the number of challenges to be overcome changes exponentially (there are however different exponential functions in which the basis and exponent make the growth rapid).

In big organizations can appear other parties that have the same force to pull the weight in one direction or another. Thus, the political aspects become more complex to the degree that the technologies must follow the political decisions, with all the positive and negative implications deriving from this. As comparison, think about the challenges from moving from two to three or more moving bodies orbiting each other, resulting in a chaotic dynamical system for most initial conditions.

Of course, a business’ context doesn’t have to create such complexity, though when things are unchecked, when delays in decision-making as well as other typical events occur, when there’s no structure, strategy, coordinated effort, or any other important components, the chances for chaotic behavior are quite high with the pass of time. This is just a model to explain real life situations that seem similar on the surface but prove to be quite complex when diving deeper. That’s probably why a CDO’s role as tamer of complexity is important and challenging!

Previous Post <<||>> Next Post

SQL Troubles

Pages

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VII: Data Stores Comparison)

14 December 2024

🧭💹Business Intelligence: Perspectives (Part 21: Data Visualization Revised)

13 December 2024

🧭💹Business Intelligence: Perspectives (Part 20: From BI to AI)

12 December 2024

🧭💹Business Intelligence: Perspectives (Part 19: Data Visualization between Art, Pragmatism and Kitsch)

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part VI: SQL Databases for OLTP scenarios) [new feature]

10 November 2024

🏭🗒️Microsoft Fabric: Data Mesh [Notes]

🏭🗒️Microsoft Fabric: Warehouse [Notes]

16 October 2024

🧭💹Business Intelligence: Perspectives (Part 18: There’s More to Noise)

𖣯Strategic Management: Strategic Perspectives (Part II: The Elephant in the Room)

15 October 2024

🗄️Data Management: Data Governance (Part III: Taming the Complexity)

About Me