SQL Troubles

22 December 2024

#️⃣Software Engineering: Mea Culpa (Part VI: A Look Back)

Software Engineering Series

Looking back at my university years, I'd say that there are three teachers, respectively courses that made a considerable impact on students' lives. In the second year I learned Category Algebra, which despite the fact that it reflected past knowledge and the topics were too complex for most of us, it provided us with a unprecedented layer of abstraction that showed us that Mathematics is not what we thought it to be!

The second course was related to the Complex plane theory, a course in which, the decan of the university at those times, challenged our way of thinking about relatively basic concepts. It was a big gap between what we thought about Mathematics, and what the subject proved to be. The course was thought in a post-university year together with a course on Relativity Theory, in which even we haven't understood much about the concepts and theories, it was the first time (except the Graph theory), we saw applied Mathematics to a broader context. Please don't misunderstand me! There were many other valuable teachers and courses, though these were the three courses that made the most important impact for me!

During those times, we attended also courses on Fortran, Pascal, C++, HTML and even dBase, and, even if each programming language brought something new in the landscape, I can't say they changed how we thought about the world (some of us had similar courses during the lyceum years) and problem solving. That's what for example SQL or more generally a database-related course brought, even if I had to wait for the first MooC courses to appear. Equally important was also Scott E Page's course on Model Theory, which introduced the model-thinking approach, a structured way of thinking about models, with applicability to the theoretical and practical aspects of life.

These are the courses that anybody interested in programming and/or IT should attend! Of course, there are also courses on algorithms, optimization, linear and non-linear programming, and they bring an arsenal of concepts and techniques to think about, though, even if they might have a wide impact, I can't compare them with the courses mentioned above. A course should (ideally) change the way we think about the world to make a sensible difference! Same goes for programming and theoretical concepts too!...

Long after I graduated, I found many books and authors that I wished I had met earlier! Quotable Math reflects some of the writings I found useful, though now it seems already too late for those books to make a considerable impact! Conversely, it's never too late to find new ways to look at life, and this is what some books achieve! This is also a way of evaluating critically what we want to read or what is worth reading!

Of course, there are many courses, books or ideas out there, though if they haven't changed the way you think about life, directly or indirectly, are they worth attending, respectively reading? Conversely, if one hasn't found a new perspective brought by a topic, probably one barely scratched the surface of the subject, independently if we talk here about students or teachers. For some topics, it's probably too much to ask, though pragmatically talking, that's the intrinsic value of what we learn!

That's a way to think about life and select the books worth reading! I know, many love reading for the sake of reading, though the value of a book, theory, story or other similar artifacts should be judged especially by the impact they have on our way of thinking, respectively on our lives. Just a few ideas that's maybe worth reflective upon...

21 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part I: Creating a View)

At this year's Ignite conference it was announced that SQL databases are available now in Fabric in public preview (see SQL Databases for OLTP scenarios, [1]). To test the functionality one can import the SalesLT database in a newly created empty database, which made available several tables:

-- tables from SalesLT schema (queries should be run individually)
SELECT TOP 100 * FROM SalesLT.Address
SELECT TOP 100 * FROM SalesLT.Customer
SELECT TOP 100 * FROM SalesLT.CustomerAddress
SELECT TOP 100 * FROM SalesLT.Product ITM 
SELECT TOP 100 * FROM SalesLT.ProductCategory
SELECT TOP 100 * FROM SalesLT.ProductDescription 
SELECT TOP 100 * FROM SalesLT.ProductModel  
SELECT TOP 100 * FROM SalesLT.ProductModelProductDescription 
SELECT TOP 100 * FROM SalesLT.SalesOrderDetail
SELECT TOP 100 * FROM SalesLT.SalesOrderHeader

The schema seems to be slightly different than the schemas used in previous tests made in SQL Server, though with a few minor changes - mainly removing the fields not available - one can create the below view:

-- drop the view (cleaning step)

-- DROP VIEW IF EXISTS SalesLT.vProducts

-- create the view
CREATE OR ALTER VIEW SalesLT.vProducts
-- Products (view) 
AS 
SELECT ITM.ProductID 
, ITM.ProductCategoryID 
, PPS.ParentProductCategoryID 
, ITM.ProductModelID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, PPM.Name ProductModel 
, PPS.Name ProductSubcategory 
, PPC.Name ProductCategory  
, ITM.Color 
, ITM.StandardCost 
, ITM.ListPrice 
, ITM.Size 
, ITM.Weight 
, ITM.SellStartDate 
, ITM.SellEndDate 
, ITM.DiscontinuedDate 
, ITM.ModifiedDate 
FROM SalesLT.Product ITM 
     JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
     JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
         JOIN SalesLT.ProductCategory PPC 
            ON PPS.ParentProductCategoryID = PPC.ProductCategoryID

-- review the data
SELECT top 100 *
FROM SalesLT.vProducts

In the view were used FULL JOINs presuming thus that a value was provided for each record. It's always a good idea to test the presumptions when creating the queries, and eventually check from time to time whether something changed. In some cases it's a good idea to always use LEFT JOINs, though this might have impact on performance and probably other consequences as well.

-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductModel PPM 
       ON ITM.ProductModelID = PPM.ProductModelID 
WHERE PPM.ProductModelID IS NULL

-- check if all models are available
SELECT top 100 ITM.*
FROM SalesLT.Product ITM 
    LEFT JOIN SalesLT.ProductCategory PPS 
        ON ITM.ProductCategoryID = PPS.ProductCategoryID 
WHERE PPS.ProductCategoryID IS NULL

-- check if all categories are available
SELECT PPS.*
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
WHERE PPC.ProductCategoryID IS NULL

Because the Product categories have an hierarchical structure, it's a good idea to check the hierarchy as well:

-- check the hierarchical structure 
SELECT PPS.ProductCategoryId 
, PPS.ParentProductCategoryId 
, PPS.Name ProductCategory
, PPC.Name ParentProductCategory
FROM SalesLT.ProductCategory PPS 
     LEFT JOIN SalesLT.ProductCategory PPC 
       ON PPS.ParentProductCategoryID = PPC.ProductCategoryID
--WHERE PPC.ProductCategoryID IS NULL
ORDER BY IsNull(PPC.Name, PPS.Name)

This last query can be consolidated in its own view and the previous view changed, if needed.

One can then save all the code as a file.

Except some small glitches in the editor, everything went smoothly.

Note:
During the various tests I got the following error message when trying to create a table:

"The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Create' was denied on the requested resource."
At least in my case all I had to do was to select "SQL Database" instead of "SQL analytics endpoint". Check the top right dropdown below your user information.

Happy coding!

References:
[1] Microsoft Learn (2024) SQL database in Microsoft Fabric (Preview) [link]

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VI: Data Stores Comparison)

Business Intelligence Series

Microsoft made available a reference guide for the data stores supported for Microsoft Fabric workloads [1], including the new Fabric SQL database (see previous post). Here's the consolidated table followed by a few aspects to consider:

Area	Lakehouse	Warehouse	Eventhouse	Fabric SQL database	Power BI Datamart
Data volume	Unlimited	Unlimited	Unlimited	4 TB	Up to 100 GB
Type of data	Unstructured, semi-structured, structured	Structured, semi-structured (JSON)	Unstructured, semi-structured, structured	Structured, semi-structured, unstructured	Structured
Primary developer persona	Data engineer, data scientist	Data warehouse developer, data architect, data engineer, database developer	App developer, data scientist, data engineer	AI developer, App developer, database developer, DB admin	Data scientist, data analyst
Primary dev skill	Spark (Scala, PySpark, Spark SQL, R)	SQL	No code, KQL, SQL	SQL	No code, SQL
Data organized by	Folders and files, databases, and tables	Databases, schemas, and tables	Databases, schemas, and tables	Databases, schemas, tables	Database, tables, queries
Read operations	Spark, T-SQL	T-SQL, Spark*	KQL, T-SQL, Spark	T-SQL	Spark, T-SQL
Write operations	Spark (Scala, PySpark, Spark SQL, R)	T-SQL	KQL, Spark, connector ecosystem	T-SQL	Dataflows, T-SQL
Multi-table transactions	No	Yes	Yes, for multi-table ingestion	Yes, full ACID compliance	No
Primary development interface	Spark notebooks, Spark job definitions	SQL scripts	KQL Queryset, KQL Database	SQL scripts	Power BI
Security	RLS, CLS**, table level (T-SQL), none for Spark	Object level, RLS, CLS, DDL/DML, dynamic data masking	RLS	Object level, RLS, CLS, DDL/DML, dynamic data masking	Built-in RLS editor
Access data via shortcuts	Yes	Yes	Yes	Yes	No
Can be a source for shortcuts	Yes (files and tables)	Yes (tables)	Yes	Yes (tables)	No
Query across items	Yes	Yes	Yes	Yes	No
Advanced analytics	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Interface for large-scale data processing, built-in data parallelism, and fault tolerance	Time Series native elements, full geo-spatial and query capabilities	T-SQL analytical capabilities, data replicated to delta parquet in OneLake for analytics	Interface for data processing with automated performance tuning
Advanced formatting support	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format	Full indexing for free text and semi-structured data like JSON	Table support for OLTP, JSON, vector, graph, XML, spatial, key-value	Tables defined using PARQUET, CSV, AVRO, JSON, and any Apache Hive compatible file format
Ingestion latency	Available instantly for querying	Available instantly for querying	Queued ingestion, streaming ingestion has a couple of seconds latency	Available instantly for querying	Available instantly for querying

It can be used as a map for what is needed to know for using each feature, respectively to identify how one can use the previous experience, and here I'm referring to the many SQL developers. One must consider also the capabilities and limitations of each storage repository.

However, what I'm missing is some references regarding the performance for data access, especially compared with on-premise workloads. Moreover, the devil hides in details, therefore one must test thoroughly before committing to any of the above choices. For the newest overview please check the referenced documentation!

For lakehouses, the hardest limitation is the lack of multi-table transactions, though that's understandable given its scope. However, probably the most important aspect is whether it can scale with the volume of reads/writes as currently the SQL endpoint seems to lag.

The warehouse seems to be more versatile, though careful attention needs to be given to its design.

The Eventhouse opens the door to a wide range of time-based scenarios, though it will be interesting how developers cope with its lack of functionality in some areas.

Fabric SQL databases are a new addition, and hopefully they'll allow considering a wide range of OLTP scenarios.

Power BI datamarts have been in preview for a couple of years.

References:
[1] Microsoft Fabric (2024) Microsoft Fabric decision guide: choose a data store [link]

[2] Reitse's blog (2024) Testing Microsoft Fabric Capacity: Data Warehouse vs Lakehouse Performance [link]

14 December 2024

🧭💹Business Intelligence: Perspectives (Part XXI: Data Visualization Revised)

Data Visualization Series

Creating data visualizations nowadays became so easy that anybody can do it with a minimum of effort and knowledge, which on one side is great for the creators but can be easily become a nightmare for the readers, respectively users. Just dumping data in visuals can be barely called data visualization, even if the result is considered as such. The problems of visualization are multiple – the lack of data culture, the lack of understanding processes, data and their characteristics, the lack of being able to define and model problems, the lack of educating the users, the lack of managing the expectations, etc.

There are many books on data visualization though they seem an expensive commodity for the ones who want rapid enlightenment, and often the illusion of knowing proves maybe to be a barrier. It's also true that many sets of data are so dull, that the lack of information and meaning is compensated by adding elements that give a kitsch look-and-feel (aka chartjunk), shifting the attention from the valuable elements to decorations. So, how do we overcome the various challenges?

Probably, the most important step when visualizing data is to define the primary purpose of the end product. Is it to inform, to summarize or to navigate the data, to provide different perspectives at macro and micro level, to help discovery, to explore, to sharpen the questions, to make people think, respectively understand, to carry a message, to be artistic or represent truthfully the reality, or maybe is just a filler or point of attraction in a textual content?

Clarifying the initial purpose is important because it makes upfront the motives and expectations explicit, allowing to determine the further requirements, characteristics, and set maybe some limits in what concern the time spent and the qualitative and/or qualitative criteria upon which the end result should be eventually evaluated. Narrowing down such aspects helps in planning and the further steps performed.

Many of the steps are repetitive and past experience can help reduce the overall effort. Therefore, professionals in the field, driven by intuition and experience probably don't always need to go through the full extent of the process. Conversely, what is learned and done poorly, has high chances of delivering poor quality.

A visualization can be considered as effective when it serves the intended purpose(s), when it reveals with minimal effort the patterns, issues or facts hidden in the data, when it allows people to explore the data, ask questions and find answers altogether. One can talk also about efficiency, especially when readers can see at a glance the many aspects encoded in the visualization. However, the more the discovery process is dependent on data navigation via filters or other techniques, the more difficult it becomes to talk about efficiency.

Better criteria to judge visualizations is whether they are meaningful and useful for the readers, whether the readers understood the authors' intent, the further intrinsic implication, though multiple characteristics can be associated with these criteria: clarity, specificity, correctedness, truthfulness, appropriateness, simplicity, etc. All these are important in lower or higher degree depending on the broader context of the visualization.

All these must be weighted in the bigger picture when creating visualizations, though there are probably also exceptions, especially on the artistic side, where artists can cut corners for creating an artistic effect, though also in here the authors need to be truthful to the data and make sure that their work don't distort excessively the facts. Failing to do so might not have an important impact on the short term,considerably. though in time the effects can ripple with unexpected effects.

Previous Post <<||>> Next Post

13 December 2024

🧭💹Business Intelligence: Perspectives (Part XX: From BI to AI)

Business Intelligence Series

No matter how good data visualizations, reports or other forms of BI artifacts are, they only serve a set of purposes for a limited amount of time, limited audience or any other factors that influence their lifespan. Sooner or later the artifacts become thus obsolete, being eventually disabled, archived and/or removed from the infrastructure.

Many artifacts require a considerable number of resources for their creation and maintenance over time. Sometimes the costs can be considerably higher than the benefits brought, especially when the data or the infrastructure are used for a narrow scope, though there can be other components that need to be considered in the bigger picture. Having a report or visualization one can use when needed can have an important impact on the business in correcting issues, sizing opportunities or filling the knowledge gaps.

Even if it’s challenging to quantify the costs associated with the loss of opportunities rooted in the lack of data, respectively information, the amounts can be considerable high, greater even than building a whole BI infrastructure. Organization’s agility in addressing the important gaps can make a considerable difference, at least in theory. Having the resources that can be pulled on demand can give organizations the needed competitive boost. Internal or external resources can be used altogether, though, pragmatically speaking, there will be always a gap between demand and supply of knowledgeable resources.

The gap in BI artefacts can be addressed nowadays by AI-driven tools, which have the theoretical potential of shortening the gap between needs and the availability of solutions, respectively a set of answers that can be used in the process. Of course, the processes of sense-making and discovery are not that simple as we’d like, though it’s a considerable step forward.

Having the possibility of asking questions in natural language and guiding the exploration process to create visualizations and other artifacts using prompt engineering and other AI-enabled methods offers new possibilities and opportunities that at least some organizations started exploring already. This however presumes the existence of an infrastructure on which the needed foundation can be built upon, the knowledge required to bridge the gap, respectively the resources required in the process.

It must be stressed out that the exploration processes may bring no sensible benefits, at least no immediately, and the whole process depends on organizations’ capabilities of identifying and sizing the respective opportunities. Therefore, even if there are recipes for success, each organization must identify what matters and how to use technologies and the available infrastructure to bridge the gap.

Ideally to make progress organizations need besides the financial resources the required skillset, a set of projects that support learning and value creation, respectively the design and execution of a business strategy that addresses the steps ahead. Each of these aspects implies risks and opportunities altogether. It will be a test of maturity for many organizations. It will be interesting to see how many organizations can handle the challenge, respectively how much past successes or failures will weigh in the balance.

AI offers a set of capabilities and opportunities, however the chance of exploring and failing fast is of great importance. AI is an enabler and not a magic wand, no matter what is preached in technical workshops! Even if progress follows an exponential trajectory, it took us more than half of century from the first steps until now and probably many challenges must be still overcome.

The future looks interesting enough to be pursued, though are organizations capable to size the opportunities, respectively to overcome the challenges ahead? Are organizations capable of supporting the effort without neglecting the other priorities?

Previous Post <<||>> Next Post

12 December 2024

🧭💹Business Intelligence: Perspectives (Part XIX: Data Visualization between Art, Pragmatism and Kitsch)

Business Intelligence Series

The data visualizations (aka dataviz) presented in the media, especially the ones coming from graphical artists, have the power to help us develop what is called graphical intelligence, graphical culture, graphical sense, etc., though without a tutor-like experience the process is suboptimal because it depends on our ability of identifying what is important and which are the steps needed for decoding and interpreting such work, respectively for integrating their messages in our overall understanding about the world.

When such skillset is lacking, without explicit annotations or other form of support, the reader might misinterpret or fail to observe important visual cues even for simple visualizations, with all the implications deriving from this – a false understanding, and further aspects deriving from it, this being probably the most important aspect to consider. Unfortunately, even the most elaborate work can fail if the reader doesn’t have a basic understanding of all that’s implied in the process.

The books of Willard Brinton, Ana Rogers, Jacques Bertin, William Cleveland, Leland Wilkinson, Stephen Few, Albert Cairo, Soctt Berinato and many others can help the readers build a general understanding of the dataviz process and how data visualizations or simple graphics can be used/misused effectively, though each reader must follow his/her own journey. It’s also true that the basics can be easily learned, though the deeper one dives, the more interesting and nontrivial the journey becomes. Fortunately, the average reader can stick to the basics and many visualizations are simple enough to be understood.

To grasp the full extent of the implications, one can make comparisons with the domain of poetry where the author uses basic constructs like metaphor, comparisons, rhythm and epithets to create, communicate and imprint in reader’s mind old and new meanings, images and feelings altogether. Artistic data visualizations tend to offer similar charge as poetry does, even if the impact might not appeal so much to our artistic sensibility. Though dataviz from this perspective is or at least resembles an art form.

Many people can write verses, though only a fraction can write good meaningful poetry, from which a smaller fraction get poems, respectively even fewer get books published. Conversely, not everything can be expressed in verses unless one finds good metaphors and other aspects that can be leveraged in the process. Same can be said about good dataviz.

One can argue that in dataviz the author can explore and learn especially by failing fast (seeing what works and what doesn’t). One can also innovate, though the creator has probably a limited set of tools and rules for communication. Enabling readers to see the obvious or the hidden in complex visualizations or contexts requires skill and some kind of mastery of the visual form.

Therefore, dataviz must be more pragmatic and show the facts. In art one has the freedom to distort or move things around to create new meanings, while in dataviz it’s important for the meaning to be rooted in 'truth', at least by definition. The more the creator of a dataviz innovates, the higher the chances of being misunderstood. Moreover, readers need to be educated in interpreting the new meanings and get used to their continuous use.

Kitsch is a term applied to art and design that is perceived as naïve imitation to the degree that it becomes a waste of resources even if somebody pays the tag price. There’s a trend in dataviz to add elements to visualizations that don’t bring any intrinsic value – images, colors and other elements can be misused to the degree that the result resembles kitsch, and the overall value of the visualization is diminished considerably.

Previous Post <<||>> Next Post

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 8-Dec-2024

Microsoft Fabric

{goal}complete (end-to-end) analytics platform [6]

{characteristic} unified

{objective} provides a single, integrated environment for all the organization

{benefit} data professionals and the business users can collaborate on data projects [5] and solutions

{characteristic}serverless SaaS model (aka SaaS-ified)

{objective} provisioned automatically with the tenant [6]
{objective} highly scalable [5]
{objective} cost-effectiveness [5]
{objective} accessible

⇐ from anywhere with an internet connection [5]

{objective} continuous updates

⇐ provided by Microsoft

{objective} continuous maintenance

⇐ provided by Microsoft

provides a set of integrated services that enable to ingest, store, process, and analyze data in a single environment [5]

{objective} secure
{objective} governed

{goal} lake-centric

{characteristic} OneLake-based

all workloads automatically store their data in the OneLake workspace folders [6]
all the data is organized in an intuitive hierarchical namespace [6]
data is automatically indexed [6]
provides a set of features

discovery
MIP labels
lineage
PII scans
sharing
governance
compliance

{characteristic} one copy

available for all computes
all compute engines store their data automatically in OneLake

⇐ the data is stored in a (single) common format

⇐ delta parquet file format

open standards format
the storage format for all tabular data in Microsoft Fabric

⇐ the data is directly accessible by all the engines [6]

⇐ no import/export needed

all compute engines are fully optimized to work with Delta Parquet as their native format [6]
a shared universal security model is enforced across all the engines [6]

{characteristic} open at every tier

{goal} empowering

{characteristic} intuitive
{characteristic} built into M365
{characteristic} insight to action

{goal} AI-powered

{characteristic} Copilot accelerated
{characteristic} ChatGPT enabled
{characteristic} AI-driven insights

complete analytics platform

addresses the needs of all data professionals and business users who target harnessing the value of data

{feature} scales automatically

the system automatically allocates an appropriate number of compute resources based on the job size
the cost is proportional to total resource consumption, rather than size of cluster or number of resources allocated
jobs in general complete faster (and usually, at less overall cost)

⇒ not need to specify cluster sizes

natively supports

Spark
data science
log-analytics
real-time ingestion and messaging
alerting
data pipelines, and
Power BI reporting
interoperability with third-party services

from other vendors that support the same open

data virtualization mechanisms

{feature} mirroring [notes]
{feature} shortcuts [notes]

allow users to reference data without copying it
{benefit} make other domain data available locally without the need for copying data

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
associated with a domain [3]

{feature} domains [notes]

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
subdomains

a way for fine tuning the logical grouping data under a domain [1]

subdivisions of a domain

Previous Post <<||>> Next Post

Acronyms:
API - Application Programming Interface
M365 - Microsoft 365
MF - Microsoft Fabric
PII - Personal Identification Information
SaaS - software-as-a-service

Resources:
[1] Microsoft Learn (2023) Administer Microsoft Fabric [link]
[2] Microsoft Learn: Fabric (2024) Governance overview and guidance [link]
[3] Microsoft Learn: Fabric (2023) Fabric domains [link]
[4] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam [link]
[5] Microsoft Learn: Fabric (2024) Introduction to end-to-end analytics using Microsoft Fabric [link]
[6] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!
Last updated: 1-Apr-2024

Delta Lake

{definition} an optimized open source storage layer that runs on top of a data lake [1]

the default storage format in a Fabric lakehouse [1]
stores data in Parquet file format
is a variant of log-structured files
initially developed at Databricks
fully compatible with Apache Spark APIs [1]

{characteristic} high reliability
{characteristic} secure
{characteristic} performant

provides low latency

{feature} data indexing

indexes are created and maintained on the ingested data [1]

increases the querying speed significantly [1]

{feature} data skipping

file statistics are maintains so that data subsets relevant to the query are used instead of entire tables - this partition pruning avoids processing data that is not relevant to the query [1]
helps complex queries to read only the relevant subsets to fulfil query [1]

{feature} multidimensional clustering

uses the Z-ordering algorithm
enables data skipping this.

{feature} compaction

compacts or combines multiple small files into more efficient larger ones [1]

speeds up query performance

storing and accessing small files can be processing-intensive, slow and inefficient from a storage utilization perspective [1]

{feature} data caching

highly accessed data is automatically cached to speed access for queries

{feature} ACID transactions

"all or nothing" ACID transaction approach is employed to prevent data corruption

⇐ partial or failed writes risk corrupting the data [1]

{feature} snapshot isolation (aka SI)

ensures that multiple writers can write to a dataset simultaneously without interfering with jobs reading the dataset [1]

{feature} schema enforcement

data can be stored using a schema
{benefit} helps ensure data integrity for ingested data by providing schema enforcement [1]

potential data corruption with incorrect or invalid schemas is avoided [1]

{feature} checkpointing

employed to provide a robust exactly once delivery semantic [1]
{benefit} ensures that data is neither missed nor repeated erroneously [1]

{feature} UPSERTS and DELETES support

provide a more convenient way of dealing with such changes [1]

{feature} unified streaming and batch data processing

both batch and streaming data are handled via a direct integration with Structured Streaming for low latency updates [1]

{benefit} simplifies the system architecture [1]
{benefit} results in shorter time from data ingest to query result [1]

can concurrently write batch and streaming data to the same data table [1]

{feature} schema evolution

schema is inferred from input data
{benefit} reduces the effort for dealing with schema impact of changing business needs at multiple levels of the pipeline/data stack [1]

{feature} scalable metadata handling
{feature} predictive optimization

removes the need to manually manage maintenance operations for delta tables [8]
{enabled} automatically identifies tables that would benefit from maintenance operations, and then optimizes their storage

{feature} historical retention

{default} maintains a history of all changes made [4]
{benefit} enhanced regulatory compliance and audit
{recommendation} keep historical data only for a certain period of time to reduce storage costs [4]

{feature} time travel

{benefit} support for data rollback
{benefit} lets users query point-in-time snapshots [5]

{best practice} all writes and reads should go through Delta Lake [1]

{benefit} ensure consistent overall behavior [1]

{best practice}run OPTIMIZE Regularly

{exception} should not be run on base or staging tables [1]

{best practice} run VACUUM Regularly

cleans up expired snapshots that are no longer required [1]

{best practice} use MERGE INTO to batch changes

{benefit} allows to efficiently rewrite queries to implement updates to archived data and compliance workflows [5]

{best practice} use DELETE commands

{benefit} ensures proper progression of the change [1]
{warning} manually deleting files from the underlying storage is likely to break the table [1]

Previous Post <<||>> Next Post

Acronyms:

ACID - atomicity, consistency, isolation, durability

References:
[1] Azure Databricks (2023) Delta Lake on Azure Databricks
[2] Josep Aguilar-Saborit et al, POLARIS: The Distributed SQL Engine in Azure Synapse, PVLDB, 13(12), 2020 (link)
[3] Josep Aguilar-Saborit et al, Extending Polaris to Support Transactions 2024
[4] Implement medallion lakehouse architecture in Microsoft Fabric (link)
[5] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment13(12) (link)
[6] Bennie Haelen & Dan Davis (2024) Delta Lake: Up and Running Modern Data Lakehouse Architectures with Delta Lake

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

Last updated: 8-Dec-2024

[Microsoft Fabric] Shortcut

{def} object that points to other internal or external storage location (aka shortcut) [1] and that can be used for data access

{goal} unifies existing data without copying or moving it [2]

⇒ data can be used multiple times without being duplicated [2]
{benefit} helps to eliminate edge copies of data [1]
{benefit} reduces process latency associated with data copies and staging [1]

is a mechanism that allows to unify data across domains, clouds, and accounts through a namespace [1]

⇒ allows creating a single virtual data lake for the entire enterprise [1]
⇐ available in all Fabric experiences [1]
⇐ behave like symbolic links [1]

independent object from the target [1]
appear as folder [1]
can be used by workloads or services that have access to OneLake [1]
transparent to any service accessing data through the OneLake API [1]

can point to

OneLake locations
ADLS Gen2 storage accounts
Amazon S3 storage accounts
Dataverse
on-premises or network-restricted locations via PDF

{capability} create shortcut to consolidate data across artifacts or workspaces, without changing data's ownership [2]
{capability} data can be compose throughout OneLake without any data movement [2]
{capability} allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement [2]

⇐ makes OneLake the first multi-cloud data lake [2]

{capability} provides support for industry standard APIs

⇒ OneLake data can be directly accessed via shortcuts by any application or service [2]

{operation} creating a shortcut

can be created in

lakehouses
KQL databases

⇐ shortcuts are recognized as external tables [1]

can be created via

Fabric UI
REST API

can be created across items [1]

the item types don't need to match [1]

e.g. create a shortcut in a lakehouse that points to data in a data warehouse [1]

[lakehouse] tables folder

represents the managed portion of the lakehouse

shortcuts can be created only at the top level [1]

⇒ shortcuts aren't supported in other subdirectories [1]

if shortcut's target contains data in the Delta\Parquet format, the lakehouse automatically synchronizes the metadata and recognizes the folder as a table [1]

[lakehouse] files folder

represents the unmanaged portion of the lakehouse [1]
there are no restrictions on where shortcuts can be created [1]

⇒ can be created at any level of the folder hierarchy [1]
⇐table discovery doesn't happen in the Files folder [1]

{operation} renaming a shortcut
{operation} moving a shortcut
{operation} deleting a shortcut

doesn't affect the target [1]

⇐ only the shortcut object is deleted [1]

⇐ the shortcut target remains unchanged [1]

shortcuts don't perform cascading deletes [1]
moving, renaming, or deleting a target path can break the shortcut [1]

{operation} delete file/folder

file or folder within a shortcut can be deleted when the permissions in the shortcut target allows it [1]

{permissions} users must have permissions in the target location to read the data [1]

when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1]
when accessing shortcuts through Power BI semantic models or T-SQL, the calling user’s identity is not passed through to the shortcut target [1]

the calling item owner’s identity is passed instead, delegating access to the calling user [1]

OneLake manages all permissions and credentials

{feature} shortcut caching

{def} mechanism used to reduce egress costs associated with cross-cloud data access [1]

when files are read through an external shortcut, the files are stored in a cache for the Fabric workspace [1]

subsequent read requests are served from cache rather than the remote storage provider [1]
cached files have a retention period of 24 hours
each time the file is accessed the retention period is reset [1]
if the file in remote storage provider is more recent than the file in the cache, the request is served from remote storage provider and the updated file will be stored in cache [1]
if a file hasn’t been accessed for more than 24hrs it is purged from the cache [1]

{restriction} individual files greater than 1GB in size are not cached [1]
{restriction} only GCS, S3 and S3 compatible shortcuts are supported [1]

{limitation} maximum number of shortcuts [1]

per Fabric item: 100,000
in a single OneLake path: 10
direct shortcuts to shortcut links: 5

{limitation} ADLS and S3 shortcut target paths can't contain any reserved characters from RFC 3986 section 2.2 [1]
{limitation} shortcut names, parent paths, and target paths can't contain "%" or "+" characters [1]
{limitation} shortcuts don't support non-Latin characters[1]
{limitation} Copy Blob API not supported for ADLS or S3 shortcuts[1]
{limitation} copy function doesn't work on shortcuts that directly point to ADLS containers

{recommended} create ADLS shortcuts to a directory that is at least one level below a container [1]

{limitation} additional shortcuts can't be created inside ADLS or S3 shortcuts [1]
{limitation} lineage for shortcuts to Data Warehouses and Semantic Models is not currently available[1]
{limitation} it may take up to a minute for the Table API to recognize new shortcuts [1]

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) OneLake shortcuts [link]
[2] Microsoft Fabric (2024) Fabric Analyst in a Day [course notes]

Acronyms:

ADLS - Azure Data Lake Storage

AWS - Amazon Web Services

GCS - Google Cloud Storage

KQL - Kusto Query Language

OPDG - on-premises data gateway

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part IV: SQL Databases for OLTP scenarios) [new feature]

Data Warehousing Series

One interesting announcements at Ignite is the availability in public preview of SQL databases in Microsoft Fabric, "a versatile and developer-friendly transactional database built on the foundation of Azure SQL database". With this Fabric can address besides OLAP also OLTP scenarios, evolving thus from analytics to a data platform [1]. According to the announcement, besides the AI-optimized architectural aspects, the feature makes the SQL Azure simple, autonomous and secure by design [1], and these latest aspects are considered in this post.

Simplicity revolves around the deployment and configuration of databases, the creation of a new database requiring giving a name and the database is created in seconds [1]. It’s a considerable improvement compared with the relatively complex setup needed for on-premise configurations, though sometimes more flexibility in configuration is needed upfront or over database’s lifetime. To get a database ready for testing one can import a sample database or get specific data via data flows and/or pipelines [1]. As development tools one can use Visual Studio Code or SSMS [1], and probably more tools will be available in time.

The integration with both GitHub and Azure DevOps allows to configure each database under source control, which is needed for many scenarios especially when multiple resources make changes to the database objects [1]. Frankly, that’s mainly important during the development phase, respectively in scenarios in which multiple people make in parallel changes to the logic. It will be interesting to see how much overhead or challenges the feature adds to development and how smoothly everything works together!

The most important aspect for many solutions is the replication of data in near-real time to the (open-source) delta parquet format in OneLake and thus making the data available for analytics almost immediately [1]. Probably, from this aspect many cloud-based applications can benefit, even if the performance might not be as good as in other well-established architectures. However, there are many other scenarios in which one needs to maintain and use data for OLTP/OLAP purposes. This invites adequate testing and a good weighting of the advantages and disadvantages involved.

A SQL database is a native item in Fabric, and therefore it utilizes Fabric capacity units like other Fabric workloads [1]. One can use the Fabric SKU estimator (still in private preview) to estimate the costs [2], though it will be interesting to see how cost-effective the solutions are. Probably, especially when the infrastructure is already available outside of Fabric, it will be easier and cost-effective to use the mirroring functionality. One should test and have a better estimator before moving blindly from the existing infrastructure to Fabric.

SQL databases in Fabric are autonomous by design, while allowing to get the best performance and availability by default [1]. High availability is reached through zone redundancy, while performance is achieved by scaling automatically the storage and compute to accommodate the workloads [1]. The auto-optimization capability is achieved with the help of the latest Intelligent Query Processing (IQP) enhancements, respectively the creation of missing indexes to improve query performance [1]. It will be interesting to see how the whole process works, given that the maintenance of indexes usually involves some challenges (e.g. identifying covering indexes, indexes needed only for temporary workloads, duplicated indexes).

SQL databases in Fabric are automatically configured for high availability with zone redundancy, while storage and compute scale automatically to accommodate the user workload [1]. The database is auto-optimized through the latest IQP enhancements while the system creates any missing indexes to improve query performance. All data is replicated to OneLake by default [1]. Finally, the database always receives the latest security updates with auto-patching, while automatic backups help in disaster recovery scenarios [1], which can be of real help for database administrators.

References:

[1] Microsoft Fabric Updates Blog (2024) Announcing SQL database in Microsoft Fabric Public Preview [link]

[2] Microsoft Fabric Updates Blog (2024) Announcing New Recruitment for the Private Preview of Microsoft Fabric SKU Estimator [link]

10 November 2024

🏭🗒️Microsoft Fabric: Data Mesh [Notes]

Last updated: 23-May-2024

[Microsoft Fabric] Data Mesh

{definition} a type of decentralized data architecture that organizes data based on different business domains [2]

⇐ a centrally managed network of decentralized data products

{concept} landing zone

typically a subscription that needs to be governed by a common policy [7]

{downside} creating one landing zone for every project can lead to too many landing zones to manage

{alternative} landing zones based on a business domain [7]

resources must be managed efficiently in a way that each team is given access to only their resources [7]

⇐ shared resources might be need with separate management and common access to all [7]

need to be linked together into a mesh

via peer-to-peer networks

{concept} connectivity hub
{feature} resource group

{definition} a container that holds related resources for an Azure solution
can be associated with a data product

when the data product becomes obsolete, the resource group can be deleted [7]

{feature} subscription

{definition} a logical unit of Azure services that are linked to an Azure account
can be associated as a landing zone governed by a policy [7]

{feature} tenant (aka Microsoft Fabric tenant, MF tenant)

a single instance of Fabric for an organization that is aligned with a Microsoft Entra ID
can contain any number of workspaces

{feature} workspaces

{definition} a collection of items that brings together different functionality in a single environment designed for collaboration
associated with a domain [3]

{feature} domains

{definition} a way of logically grouping together data in an organization that is relevant to a particular area or field [1]
some tenant-level settings for managing and governing data can be delegated to the domain level [2]

{feature} subdomains
- a way for fine tuning the logical grouping data under a domain [1]
- subdivisions of a domain
{concept} deployment template

Previous Post <<||>> Next Post

References

[1] Microsoft Learn: Fabric (2023) Fabric domains (link)

[2] Establishing Data Mesh architectural pattern with Domains and OneLake on Microsoft Fabric, by Maheswaran Arunachalam (link)

[3] Data mesh: A perspective on using Azure Synapse Analytics to build data products, by Amanjeet Singh (link)

[4] Zhamak Dehghani (2021) Data Mesh: Delivering Data-Driven Value at Scale

[5] Marthe Mengen (2024) How do you set up your Data Mesh in Microsoft Fabric? (link)

[6] Administering Microsoft Fabric - Considering Data Products vs Domains vs Workspaces, by Paul Andrew (link)

[7] Aniruddha Deswandikar (2024) Engineering Data Mesh in Azure Cloud

🏭🗒️Microsoft Fabric: Data Warehouse [Notes]

Last updated: 11-Mar-2024

Warehouse vs SQL analytics endpoint in Microsoft Fabric [3]

[Microsoft Fabric] Data Warehouse

highly available relational data warehouse that can be used to store and query data in the Lakehouse

supports the full transactional T-SQL capabilities
modernized version of the traditional data warehouse

unifies capabilities from Synapse Dedicated and Serverless SQL Pools
modernized with key improvements
resources are managed elastically to provide the best possible performance

⇒ no need to think about indexing or distribution
a new parser gives enhanced CSV file ingestion time
metadata is now cached in addition to data
improved assignment of compute resources to milliseconds
multi-TB result sets are streamed to the client

leverages a distributed query processing engine

provides with workloads that have a natural isolation boundary [3]

true isolation is achieved by separating workloads with different characteristics, ensuring that ETL jobs never interfere with their ad hoc analytics and reporting workloads [3]

{operation} data ingestion

involves moving data from source systems into the data warehouse [2]

the data becomes available for analysis [1]

via Pipelines, Dataflows, cross-database querying, COPY INTO command
no need to copy data from the lakehouse to the data warehouse [1]

one can query data in the lakehouse directly from the data warehouse using cross-database querying [1]

{operation} data storage

involves storing the data in a format that is optimized for analytics [2]

{operation} data processing

involves transforming the data into a format that is ready for consumption by analytical tools [1]

{operation} data analysis and delivery

involves analyzing the data to gain insights and delivering those insights to the business [1]

{operation} designing a warehouse (aka warehouse design)

standard warehouse design can be used

{operation} sharing a warehouse (aka warehouse sharing)

a way to provide users read access to the warehouse for downstream consumption

via SQL, Spark, or Power BI

the level of permissions can be customized to provide the appropriate level of access

{feature} mirroring

provides a modern way of accessing and ingesting data continuously and seamlessly from any database or data warehouse into the Data Warehousing experience in Fabric

any database can be accessed and managed centrally from within Fabric without having to switch database clients
data is replicated in a reliable way in real-time and lands as Delta tables for consumption in any Fabric workload

{concept}SQL analytics endpoint

a warehouse that is automatically generated from a Lakehouse in Microsoft Fabric [3]

{concept}virtual warehouse

can containing data from virtually any source by using shortcuts [3]

{concept} cross database querying

enables to quickly and seamlessly leverage multiple data sources for fast insights and with zero data duplication [3]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn: Fabric (2023) Get started with data warehouses in Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Microsoft Fabric decision guide: choose a data store (link)
[3] Microsoft Learn: Fabric (2024) What is data warehousing in Microsoft Fabric? (link)
[4] Microsoft Learn: Fabric (2023) Better together: the lakehouse and warehouse (link)

Resources:
[1] Microsoft Learn: Fabric (2023) Data warehousing documentation in Microsoft Fabric (link)

16 October 2024

🧭💹Business Intelligence: Perspectives (Part XVIII: There’s More to Noise)

Business Intelligence Series

Visualizations should be built with an audience's characteristics in mind! Upon case, it might be sufficient to show only values or labels of importance (minima, maxima, inflexion points, exceptions, trends), while other times it might be needed to show all or most of the values to provide an accurate extended perspective. It even might be useful to allow users switching between the different perspectives to reduce the clutter when navigating the data or look at the patterns revealed by the clutter.

In data-based storytelling are typically shown the points, labels and further elements that support the story, the aspects the readers should focus on, though this approach limits the navigability and users’ overall experience. The audience should be able to compare magnitudes and make inferences based on what is shown, and the accurate decoding shouldn’t be taken as given, especially when the audience can associate different meanings to what’s available and what’s missing.

In decision-making, selecting only some well-chosen values or perspectives to show might increase the chances for a decision to be made, though is this equitable? Cherry-picking may be justified by the purpose, though is in general not a recommended practice! What is not shown can be as important as what is shown, and people should be aware of the implications!

One person’s noise can be another person’s signal. Patterns in the noise can provide more insight compared with the trends revealed in the "unnoisy" data shown! Probably such scenarios are rare, though it’s worth investigating what hides behind the noise. The choice of scale, the use of special types of visualizations or the building of models can reveal more. If it’s not possible to identify automatically such scenarios using the standard software, the users should have the possibility of changing the scale and perspective as seems fit.

Identifying patterns in what seems random can prove to be a challenge no matter the context and the experience in the field. Occasionally, one might need to go beyond the general methods available and statistical packages can help when used intelligently. However, a presenter’s challenge is to find a plausible narrative around the findings and communicate it further adequately. Additional capabilities must be available to confirm the hypotheses framed and other aspects related to this approach.

It's ideal to build data models and a set of visualizations around them. Most probable some noise may be removed in the process, while other noise will be further investigated. However, this should be done through adjustable visual filters because what is removed can be important as well. Rare events do occur, probably more often than we are aware and they may remain hidden until we find the right perspective that takes them into consideration.

Probably, some of the noise can be explained by special events that don’t need to be that rare. The challenge is to identify those parameters, associations, models and perspectives that reveal such insights. One’s gut feeling and experience can help in this direction, though novel scenarios can surprise us as well.

Not in every set of data one can find patterns, respectively a story trying to come out. Whether we can identify something worth revealing depends also on the data available at our disposal, respectively on whether the chosen data allow identifying significant patterns. Occasionally, the focus might be too narrow, too wide or too shallow. It’s important to look behind the obvious, to look at data from different perspectives, even if the data seems dull. It’s ideal to have the tools and knowledge needed to explore such cases and here the exposure to other real-life similar scenarios is probably critical!

Previous Post <<||>> Next Post

𖣯Strategic Management: Strategic Perspectives (Part II: The Elephant in the Room)

Strategic Management Perspectives

There’s an ancient parable about several blind people who touch a shape they had never met before, an elephant, and try to identify what it is. The elephant is big, more than each person can sense through direct experience, and people’s experiences don’t correlate to the degree that they don’t trust each other, the situation escalating upon case. The moral of the parable is that we tend to claim (absolute) truths based on limited, subjective experience [1], and this can easily happen in business scenarios in which each of us has a limited view of the challenges we are facing individually and as a collective.

The situation from the parable can be met in business scenarios, when we try to make sense of the challenges we are faced with, and we get only a limited perspective from the whole picture. Only open dialog and working together can get us closer to the solution! Even then, the accurate depiction might not be in sight, and we need to extrapolate the unknown further.

A third-party consultant with experience might be the right answer, at least in theory, though experience and solutions are relative. The consultant might lead us in a direction, though from this to finding the answer can be a long way that requires experimentation, a mix of tactics and strategies that change over time, more sense-making and more challenges lying ahead.

We would like a clear answer and a set of steps that lead us to the solution, though the answer is as usual, it depends! It depends on the various forces/drivers that have the biggest impact on the organization, on the context, on the organization’s goals, on the resources available directly or indirectly, on people’s capabilities, the occurrences of external factors, etc.

In many situations the smartest thing to do is to gather information, respectively perspectives from all the parties. Tools like brainstorming, SWOT/PESTLE analysis or scenario planning can help in sense-making to identify the overall picture and where the gravity point lies. For some organizations the solution will be probably a new ERP system, or the redesign of some processes, introduction of additional systems to track quality, flow of material, etc.

A new ERP system will not necessarily solve all the issues (even if that’s the expectation), and some organizations just try to design the old processes into a new context. Process redesign in some areas can be upon case a better approach, at least as primary measure. Otherwise, general initiatives focused on quality, data/information management, customer/vendor management, integrations, and the list remains open, can provide the binder/vehicle an organization needs to overcome the current challenges.

Conversely, if the ERP or other strategical systems are 10-20 years old, then there’s indeed an elephant in the room! Moreover, the elephant might be bigger than we can chew, and other challenges might lurk in its shadow(s). Everything is a matter of perspective with no apparent unique answer. Thus, finding an acceptable solution might lurk in the shadow of the broader perspective, in the cumulated knowledge of the people experiencing the issues, respectively in some external guidance. Unfortunately, the guides can be as blind as we are, making limited or no important impact.

Sometimes, all it’s needed is a leap of faith corroborated with a set of tactics or strategies kept continuously in check, redirected as they seem fit based on the knowledge accumulated and the challenges ahead. It helps to be aware of how others approached the same issues. Unfortunately, there’s no answer that works for all! In this lies the challenge, in identifying what works and makes sense for us!

Previous Post <<||>> Next Post

Resources:
[1] Wikipedia (2024) Blind men and an elephant [link]

SQL Troubles

Pages

22 December 2024

#️⃣Software Engineering: Mea Culpa (Part VI: A Look Back)

21 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part I: Creating a View)

18 December 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part VI: Data Stores Comparison)

14 December 2024

🧭💹Business Intelligence: Perspectives (Part XXI: Data Visualization Revised)

13 December 2024

🧭💹Business Intelligence: Perspectives (Part XX: From BI to AI)

12 December 2024

🧭💹Business Intelligence: Perspectives (Part XIX: Data Visualization between Art, Pragmatism and Kitsch)

09 December 2024

🏭🗒️Microsoft Fabric: Microsoft Fabric [Notes]

🏭🗒️Microsoft Fabric: Delta Lake [Notes]

08 December 2024

🏭🗒️Microsoft Fabric: Shortcuts [Notes]

07 December 2024

🏭 💠Data Warehousing: Microsoft Fabric (Part IV: SQL Databases for OLTP scenarios) [new feature]

10 November 2024

🏭🗒️Microsoft Fabric: Data Mesh [Notes]

🏭🗒️Microsoft Fabric: Data Warehouse [Notes]

16 October 2024

🧭💹Business Intelligence: Perspectives (Part XVIII: There’s More to Noise)

𖣯Strategic Management: Strategic Perspectives (Part II: The Elephant in the Room)

About Me