SQL Troubles: delta table

Showing posts with label delta table. Show all posts

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 14-Feb-2024

[Microsoft Fabric] Partitions

{def} a data organization technique used to split a large dataset into smaller, more manageable nonoverlapping subsets (aka partitions, shards)

a pattition is defined based on one or more fields
each partition contains a subset of the data
each partitions can be stored and processed independently

{goal} improve performance, scalability, and manageability of large data tables
{benefit} allows to split large tables into smaller, manageable partitions based on specific criteria [2]

e.g., date ranges, regions, categories, entities

{benefit} allows to improve queries' performance as they can target specific partitions [2]

reduces the amount of data scanned [2]
improves queries' performance [2]

{benefit} allows for more efficient data loading [2]
{benefit} facilitates the management of big tables [2]

maintenance tasks can be performed on individual partitions [2]
obsolete data partitions can be removed with no overhead, adding new partitions on a need basis [2]

applies to

backups
indexing
allows optimizing query performance for specific subsets of data
statistics

performance can be affected by

the choice of partition columns for a delta table [1]
the number and size of partitions of the partition column [1]
a column with high cardinality (mostly or entirely made of unique values) results in a large number of partitions [1]

⇐ negatively impacts performance of the metadata discovery scan for changes [1]
{recommendation} if the cardinality of a column is high, choose another column for partitioning [1]

the size of each partition can also affect performance

{recommendation} use a column that would result in a partition of at least (or close to) 1 GB [1]
{recommendation} follow the best practices for delta tables maintenance [1]
a large volume of small-sized parquet files increases the time it takes to sync the changes [1]

⇒ leads to large number of parquet files in a delta table due to [1]

over-partitioning

partitions with high number of unique values [1]
{recommendation} choose a partition column that

doesn't have a high cardinality [1]
results in individual partition size of at least 1 GB [1]

many small files

batch and streaming data ingestion rates might result in small files

depends on the frequency and size of changes being written to a lakehouse [1]

{recommendation} implement regular lakehouse table maintenance [1]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]

[2] Kenneth A Omorodion (2024) Partitioning Data in Microsoft Fabric to Improve Performance

written [link]

[3] Microsoft Learn (2024) Microsoft Fabric: Loading Fabric Lakehouse Tables with partitions [link]

[4]

Resources

[R1] Microsoft Learn (2024) Microsoft Fabric: Load data to Lakehouse using partition in a Data pipeline [link]

[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

{def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
- enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
- created by default for each lakehouses and mirrored databases provisioned in a workspace
- manages the automatically generated tables so the workspace users can't modify them [1]
{feature} a background process is responsible for

scanning lakehouse for changes [1]

automatically generates a table in the appropriate schema for every Delta table in the Lakehouse

tables are created with a minor delay [1]
data is refreshed automatically

the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]

{alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]

keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]

{characteristic} transparently managed

⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
when a change is detected in the lakehouse

a background process updates metadata [1]
the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]

{characteristic} low latency

under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
the actual length of time can vary from a few seconds to minutes [1]

the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]

ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]

⇐ provides a seamless experience [3]

{feature} can be enriched by adding database objects

schemas
views
procedures
other database objects

{feature} automatic metadata discovery

tracks changes committed to lakehouses [1]

is a single instance per Fabric workspace [1]
{issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]

{resolution} migrate each lakehouse to a separate workspace [1]

allows automatic metadata discovery to scale [1]

{issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint

create a new table in the [1]
many committed rows may not be visible in the endpoint [1]
⇐ may be impacted by ETL processing that generate large volumes of changes
{recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]

forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]

{issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]

{recommendation}schedule regular lakehouse table maintenance operations [1]

{limitation} not all Delta features are understood by the automatic sync process [1]

{feature} endpoint reprovisioning

if the initial provisioning attempt fails, users have the option to try again [2]

⇐ without the need to create an entirely new lakehouse [2]
{benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon.

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work.

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed.

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done.

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1].

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

10 April 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Import data from CSV Files into Delta Table

Microsoft provides a set of labs and exercises that can be used to learn working with data in Fabric, however the real learning comes when one considers an example that introduces something new. As I've downloaded some time ago an archive with several datasets on Sales forecast from the Kaggle website, I tried to import the Features dataset in different ways and see how it goes.

Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
1,2010-02-05,42.31,2.572,NA,NA,NA,NA,NA,211.0963582,8.106,FALSE
1,2010-02-12,38.51,2.548,NA,NA,NA,NA,NA,211.2421698,8.106,TRUE
1,2010-02-19,39.93,2.514,NA,NA,NA,NA,NA,211.2891429,8.106,FALSE
1,2010-02-26,46.63,2.561,NA,NA,NA,NA,NA,211.3196429,8.106,FALSE
1,2010-03-05,46.5,2.625,NA,NA,NA,NA,NA,211.3501429,8.106,FALSE

Within an existing lakehouse, one can import the CSV as it is via 'Files/Upload' and once the data is imported, once can navigate to the file and use 'Load to Tables/New Table' to important the data into a managed table. Unfortunately, because some of the numeric fields include also literal values "NA" for the columns with NULLs, their data type is considered as varchar(8000), which is not ideal for calculations:

-- table created via Load to Tables
CREATE TABLE [dbo].[walmart_features](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [float] NULL,
	[Fuel_Price] [float] NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]

This could be fixed by replacing the NA values with an empty value, which I did and used this version for the next steps.

I tried then using Spark to import the data, though then all the fields are defined as varchar(8000).

-- table created via Spark
CREATE TABLE [dbo].[walmart_features2](
	[Store] [varchar](8000) NULL,
	[Date] [varchar](8000) NULL,
	[Temperature] [varchar](8000) NULL,
	[Fuel_Price] [varchar](8000) NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [varchar](8000) NULL
) ON [PRIMARY]
GO

So, is needed to define the schema explicitly, however I had to import the IsHoliday as string and cast the value explicitly to a Boolean using a second data frame (see alternatives):

from pyspark.sql.types import *
from pyspark.sql.functions import *

#define schema
featuresSchema = StructType([
      StructField("Store", IntegerType())
    , StructField("Date", DateType())
    , StructField("Temperature",  DecimalType(13,2))
    , StructField("Fuel_Price", DecimalType(13,2))
    , StructField("MarkDown1", DecimalType(13,2))
    , StructField("MarkDown2", DecimalType(13,2))
    , StructField("MarkDown3", DecimalType(13,2))
    , StructField("MarkDown4", DecimalType(13,2))
    , StructField("MarkDown5", DecimalType(13,2))
    , StructField("CPI", DecimalType(18,6))
    , StructField("Unemployment", DecimalType(13,2))
    , StructField("IsHoliday", StringType())
])

# Load a file into a dataframe
df = spark.read.load('Files/OpenSource/features2.csv'
    , format='csv'
    , schema = featuresSchema
    , header=True)

# do the conversion for isHoliday
df2 = df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType()))

# Save the dataframe as a delta table
df2.write.format("delta").saveAsTable("walmart_features3")

Now, table's definition looks much better:

-- table created via Spark with explicit schema
CREATE TABLE [dbo].[walmart_features3](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [decimal](13, 2) NULL,
	[Fuel_Price] [decimal](13, 2) NULL,
	[MarkDown1] [decimal](13, 2) NULL,
	[MarkDown2] [decimal](13, 2) NULL,
	[MarkDown3] [decimal](13, 2) NULL,
	[MarkDown4] [decimal](13, 2) NULL,
	[MarkDown5] [decimal](13, 2) NULL,
	[CPI] [decimal](18, 6) NULL,
	[Unemployment] [decimal](13, 2) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]
GO

Comments:
(1) I tried to apply the schema change directly on the initial data frame, though the schema didn't change:

df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType()))

(2) For the third method one could have left the NA in because by the conversion a NULL will be considered. Conversely, it might be needed to check if there are other values that fail the conversion.

(3) The following warning in the Notebook when running the above code is a hint that something went wrong during the conversion (e.g. decimals were cut):

"Your file(s) might include corrupted records"

(4) Especially for the transformed values it makes sense to look at the values (at last when the dataset isn't too big):

-- validating the values for the Boolean data field
SELECT IsHoliday
, count(*) NoRecords
FROM dbo.walmart_features3
GROUP BY IsHoliday
ORDER BY 1

(5) The tables can be deleted directly in the lakehouse or via PySpark (observe the catalog.table_name):

#dropping the table
spark.sql('DROP TABLE IF EXISTS UAT.walmart_features3')

At the beginning probably it makes sense to remove "IF EXISTS" to make sure that the table is available.

(6) For those who run into a similar issue, the SQL Endpoint for the lakehouse is read only, therefore attempting to delete a table via SSMS will result in such an error:

"Drop failed for Table 'dbo.walmart_features2'. (Microsoft.SqlServer.Smo)
The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Drop' was denied on the requested resource.
Cannot drop the table 'walmart_features2', because it does not exist or you do not have permission."

Happy coding!

10 March 2024

🏭📑Microsoft Fabric: Medallion Architecture [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 10-Mar-2024

Medallion Architecture in Microsoft Fabric [1]

Medallion architecture

a recommended data design pattern used to organize data in a lakehouse logically [2]

compatible with the concept of data mesh

{goal} incrementally and progressively improve the structure and quality of data as it progresses through each stage [1]

brings structure and efficiency to a lakehouse environment [2]
ensures that data is reliable and consistent as it goes through various checks and changes [2]
complements other data organization methods, rather than replacing them [2]

consists of three distinct layers (or zones)

{layer} bronze (aka raw zone)

stores source data in its original format [1]
the data in this layer is typically append-only and immutable [1]
{recommendation} store the data in its original format, or use Parquet or Delta Lake [1]
{recommendation} create a shortcut in the bronze zone instead of copying the data across [1]

works with OneLake, ADLS Gen2, Amazon S3, Google

{operation} ingest data

{characteristic} maintains the raw state of the data source [3]
{characteristic} is appended incrementally and grows over time [3]
{characteristic} can be any combination of streaming and batch transactions [3]
⇒ retaining the full, unprocessed history

⇒ provides the ability to recreate any state of a given data system [3]

additional metadata may be added to data on ingest

e.g. source file names, recording the time data was processed

{goal} enhanced discoverability [3]
{goal} description of the state of the source dataset [3]
{goal} optimized performance in downstream applications [3]

{layer} silver (aka enriched zone)

stores data sourced from the bronze layer
the raw data has been

cleansed
standardized
structured as tables (rows and columns)
integrated with other data to provide an enterprise view of all business entities

{recommendation} use Delta tables

provide extra capabilities and performance enhancements [1]

{default} every engine in Fabric writes data in the delta format and use V-Order write-time optimization to the Parquet file format [1]

{operation} validate and deduplicate data
for any data pipeline, the silver layer may contain more than one table [3]

{layer} gold (aka curated zone)

stores data sourced from the silver layer [1]
the data is refined to meet specific downstream business and analytics requirements [1]
tables typically conform to star schema design

supports the development of data models that are optimized for performance and usability [1]

use lakehouses (one for each zone), a data warehouse, or combination of both

the decision should be based on team's preference and expertise of your team.
different analytic engines can be used [1]

⇐ schemas and tables within each layer can take on a variety of forms and degrees of normalization [3]

depends on the frequency and nature of data updates and the downstream use cases for the data [3]

{pattern} create each zone as a lakehouse

business users access data by using the SQL analytics endpoint [1]

{pattern} create the bronze and silver zones as lakehouses, and the gold zone as data warehouse

business users access data by using the data warehouse endpoint [1]

{pattern} create all lakehouses in a single Fabric workspace

{recommendation} create each lakehouse in its own workspace [1]
provides more control and better governance at the zone level [1]

{concept} data transformation

involves altering the structure or content of data to meet specific requirements [2]

via Dataflows (Gen2), notebooks

{concept} data orchestration

refers to the coordination and management of multiple data-related processes, ensuring they work together to achieve a desired outcome [2]

via data pipelines

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn: Fabric (2023) Implement medallion lakehouse architecture in Microsoft Fabric (link)
[2] Microsoft Learn: Fabric (2023) Organize a Fabric lakehouse using medallion architecture design (link)
[3] Microsoft Learn: Azure (2023) What is the medallion lakehouse architecture? (link)

Resources:
[R1] Serverless.SQL (2023) Data Loading Options With Fabric Workspaces, by Andy Cutler (link)
[R2] Microsoft Learn: Fabric (2023) Lakehouse end-to-end scenario: overview and architecture (link)
[R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Store Gen2

17 February 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part I: Notebooks)

Business Intelligence Series

When several technologies make their entrance in a data-related field like Data Warehousing, Data Analitics or Data Science, one is forced to understand how the respective technologies can be used or misused, respectively what's their place in the bigger picture. Microsoft Fabric introduces several important technologies that will change the way data are stored, processed and consumed.

The first important technology is the notebook - a web document-like cell-based container for writing and executing code in a collaborative manner. The concept is not new, Jupyter notebooks have been around for almost a decade. In Microsof Fabric, notebooks support multiple languages, from which a default one applies to the whole notebook, while on cell level any of the supported languages can be used.

One can execute a single cell, multiple cells or the entire notebook in a sequential manner, mix languages for the various operations - load, transform, save, and visualize data when needed. Notebooks can be parametrized and run via the homonymous activity in Data Factory pipelines, automating thus data processing. Probably more functionality is to come.

Data engineers seems to have great flexibility, though usually flexibility implies constraints and/or mischiefs in other areas. I see for example in presentations the overuse of temporary data objects (mainly views) in Spark SQL as part of complex logic. That's acceptable during prototyping, though such code becomes a danger as soon the logic is deployed into production. Data objects should be created outside of the logic that uses them and should be treated as artifacts, with version control and proper documentation. It's maybe true that temporary objects reduce the volume of objects in the metastore, though is this the way to go?

Temporary objects tend to lead to wheel's reinvention or they get duplicated across multiple notebooks, which can easily create a maintenance nightmare. One needs to consider that the business logic changes a lot, the requirements and the data sources change, and on the long term, the cost of maintaining the code can easily overweight the benefits.

Notebooks remind me of the beginnings of web programming when HTML was mixed up with client scripting languages like VB Script or Javascript, CSS, respectively server-side scripting languages. It was kind of a spaghetti code, modified repeatedly by multiple programmers, unendingly duplicated, and through a miracle it worked, until it stopped working unexpectedly in strangest situations. The strangest part was when after removing commented code from a section made the code run again.

The debugging of another person's code was a nightmare. Code developed by two people for similar purposes was looking unrecognizable different in terms of structure, programming techniques and layout. The technical debt was high, increasing in exponential manner. One was aware that the code needed refactoring, though there were more important things to do or no time allocated for it.

In the meantime the maturity of programming languages, frameworks, methodologies, best practices, and hopefully of programmers improved the overall quality of software (at least on average). Thinking of software from an Engineer's perspective improved the efficiency and effectiveness of a programmer's endeavor. The average programmer is able to write quality code, though there's a considerable minimum of "engineering" knowledge involved beside the mere knowledge of languages and tools.

Notebooks are good up to a point, beyond which one needs to take a step back, restructure, move the code where it belongs, take a few more steps back and review the good practices and their application, disseminate the knowledge inside the team and use it in the next iterations, respectively refractor the code when needed! Hopefully, people learned from the mistakes of the past.

Resources:
[1] Microsoft Learn (2023) How to use Microsoft Fabric notebooks (link)

07 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Views and other Data Objects

One reads in the training material that the SQL Endpoint provides a read-only experience [1], meaning that no data can be written back to the delta lake tables. Playing with the metadata available in Spark SQL via Notebooks and the SQL Endpoint (see post), I realized that there is more to the statement! Even if one can query via the SQL Endpoint only delta tables, this doesn't mean that one can't build a semantic model on top of it, much like one was able to do via the Serverless SQL pool in Azure Synapse.

In Spark one can create via SQL, PySpark and the other supported languages views and functions, though they will not be available to the SQL Endpoint! To use the data generated in the process, the respective data needs to be saved to delta tables. Conversely, one can still create views, functions and stored procedures via the SQL Endpoint though the objects won't be available in Spark SQL!

This has important implications, though in this post let's focus on the syntax and create several objects for testing purposes in the two environments. I'll use the Assets delta table created in a previous post. The Spark SQL code should be run in a notebook (e.g. one cell per group of statements), while the code for the SQL Endpoint should be run in SQL Server Management Studio.

Views

/* test view's creation in the SQL Endpoint */

-- drop the test view 
DROP VIEW IF EXISTS dbo.vAssets_Microsoft2;
GO

-- create the test view
CREATE VIEW dbo.vAssets_Microsoft2
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = 'Microsoft';
GO

-- test the viwe
SELECT *
FROM dbo.vAssets_Microsoft2;

/* test view's creation in Spark SQL */

-- drop test view 
DROP VIEW IF EXISTS vAssets_Microsoft;

-- create test view
CREATE VIEW vAssets_Microsoft COMMENT 'Microsoft assets in scope (view)'
AS
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM assets
WHERE Vendor = 'Microsoft';

-- review data
SELECT *
FROM vAssets_Microsoft;

Table-Valued Functions

/* test function's creation in the SQL Endpoint */

-- drop the test function 
DROP FUNCTION IF EXISTS dbo.fAssets_Microsoft2;
GO

-- create the test function
CREATE FUNCTION dbo.fAssets_Microsoft2(
    @Vendor nvarchar(max))
RETURNS TABLE 
AS 
RETURN (
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM dbo.Assets
    WHERE Vendor = @Vendor
);
GO

-- test the function
SELECT *
FROM dbo.fAssets_Microsoft2('Microsoft');

SELECT *
FROM dbo.fAssets_Microsoft2('Dell');

Unfortunately, the Spark SQL code doesn't seem to work, its execution returning a PARSE_SYNTAX_ERROR error no matter how simple the code was (see also [2]).

/* test function's creation in Spark SQL */

-- drop test function 
DROP FUNCTION IF EXISTS fAssets_Microsoft;

-- create test function
CREATE FUNCTION fAssets_Microsoft(
    pVendor string)
RETURNS TABLE
AS 
RETURN 
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM assets
    WHERE Vendor = pVendor;

-- review data
SELECT *
FROM fAssets_Microsoft('Microsoft');

Stored Procedure

Stored procedures aren't available in Spark SQL, though this doesn't mean that we can't test the code in the SQL Endpoint:

/* test procedure's creation in the SQL Endpoint */

-- drop the test procedure 
DROP PROCEDURE IF EXISTS dbo.spAssets_Microsoft2;
GO

-- create the test procedure
CREATE PROCEDURE dbo.spAssets_Microsoft2(
@Vendor nvarchar(max) = NULL)
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = IsNull(@Vendor, Vendor);
GO

-- test the procedure
EXEC dbo.spAssets_Microsoft2 'Microsoft';
EXEC dbo.spAssets_Microsoft2 'Dell';
EXEC dbo.spAssets_Microsoft2;

Notes:

1) I observed in documentation and some presentations that the common practice of prefixing data objects based on their type is seldom considered. I still find it useful when building solutions, even if object's type can be derived from the context and/or metadata.

2) The examples were chosen to test the minimal functionality so that the differences between the two platforms are minimal - using the dbo schema and the GO command in the SQL Endpoint, COMMENT in Spark SQL. However, as soon specific functionality is used, extra code is needed to mitigate the differences.
3) The names between environments were kept different, just in case one needs to test objects' availability between platforms.

Happy coding!

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Learn (2023) Work with Delta Lake tables in Microsoft Fabric (link)

[2] Databricks (2023) CREATE FUNCTION (SQL and Python) (link)

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

There seem to be four main sources for learning about the functionality available in the Delta Lake especially in what concerns the SQL dialect used by the Spark SQL: Databricks [1], Delta Lake [2], Azure Databricks [3], respectively the Data Engineering documentation in Microsoft Fabric [4] and the afferent certification material. Unfortunately, the latter focuses more on PySpark. So, until Microsoft addresses the gap, one can consult the other sources, check what's working and built thus the required knowledge for handling the various tasks.

First of all, it's important to understand which the data objects available in Microsoft Fabric are. Based on [5] I could identify the following hierarchy:

According to the same source [5] the metastore contains all of the metadata that defines data objects in the lakehouse, while the catalog is the highest abstraction in the lakehouse. The database, called also a schema, keeps its standard definition - a collection of data objects, such as tables or views (aka relations) and functions. The tables, views and functions keep their standard definitions. Except the metastore, these are also the (securable) objects on which permissions can be set.

One can explore the structure in Spark SQL by using the SHOW command:

-- explore the data objects from the lakehouse
SHOW CATALOGS;

SHOW DATABASES;

SHOW SCHEMAS;

SHOW CATALOGS;

SHOW VIEWS;

SHOW TABLES;

SHOW FUNCTIONS;

Moreover, one can list only the objects from an object from the parent (e.g. the tables existing in a database):

-- all tables from a database
SHOW TABLES FROM Testing;

-- all tables from a database matching a pattern
SHOW TABLES FROM Testing LIKE 'Asset*';

-- all tables from a database matching multiple patterns
SHOW TABLES FROM Testing LIKE 'Asset*|cit*';

Notes:
1) Same syntax applies for views and functions, respectively for the other objects in respect to their parents (from the hierarchy).
2) Instead of FROM one can use the IN keyword, though I'm not sure what's the difference given that they seem to return same results.
3) For databases and tables one can use also the SQL Server to export the related metadata. Unfortunately, it's not the case for views and functions because when the respective objects are created in Spark SQL, they aren't visible over the SQL Endpoint, and vice versa.

4) Given that there are multiple environments that deal with delta tables, to minimize the confusion that might result from their use, it makes sense to use database as term instead of schema.

References:
[1] Databricks (2024) SQL language reference (link)
[2] Delta Lake (2023) Delta Lake documentation (link)
[3] Microsoft Learn (2023) Azure Databricks documentation (link)
[4] Microsoft Learn (2023) Data Engineering documentation in Microsoft Fabric (link)
[5] Databricks (2023) Data objects in the Databricks lakehouse (link)

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

While using the DESCRIBE TABLE metadata I observed that its output shown also a Comment attribute which was NULL for all the columns. Browsing through the Databricks documentation (see [1]), I found that the Comment can be provided in a delta table's definition as follows (code to be run in a notebook):

-- drop the table (if already exists)
DROP TABLE IF EXISTS Assets2;

--create the table
CREATE TABLE Assets2 (
 Id int NOT NULL COMMENT 'Asset UID',
 CreationDate timestamp NOT NULL COMMENT 'Creation Date',
 Vendor string NOT NULL COMMENT 'Vendors name',
 Asset string NOT NULL COMMENT 'Asset type',
 Model string NOT NULL COMMENT 'Asset model',
 Owner string NOT NULL COMMENT 'Current owner',
 Tag string NOT NULL COMMENT 'Assets tag',
 Quantity decimal(13, 2) NOT NULL COMMENT 'Quantity'
) 
USING DELTA
COMMENT 'Vendor assets';

-- show table's definition
DESCRIBE TABLE Assets2;

-- show table's details
DESCRIBE DETAIL Assets;

So, I thought there must be a way to update Assets table's definition as well. And here's the solution split into two steps, as different syntax is required for modifying the columns, respectively the table:

-- modify columns' COMMENT for an existing table
ALTER TABLE Assets ALTER COLUMN ID COMMENT 'Asset UID';
ALTER TABLE Assets ALTER COLUMN CreationDate COMMENT 'Creation Date';
ALTER TABLE Assets ALTER COLUMN Vendor COMMENT 'Vendors name';
ALTER TABLE Assets ALTER COLUMN Asset COMMENT 'Asset type';
ALTER TABLE Assets ALTER COLUMN Model COMMENT 'Asset model';
ALTER TABLE Assets ALTER COLUMN Owner COMMENT 'Current owner';
ALTER TABLE Assets ALTER COLUMN Tag COMMENT 'Assets tag';
ALTER TABLE Assets ALTER COLUMN Quantity COMMENT 'Quantity';

-- show table's definition
DESCRIBE TABLE Assets;

The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180621522,"operation":"CHANGE COLUMN","operationParameters":{"column":"{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}"},"readVersion":13,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"42590d18-e04a-4fb2-bbfa-94414b23fb07"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

And the second step (see [2]):

-- modify a table's COMMENT
COMMENT ON TABLE Assets IS 'Vendor assets';

-- describe table's details
DESCRIBE DETAIL Assets;

The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180808138,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"comment\":\"Vendor assets\"}"},"readVersion":14,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"21c315b5-a81e-4107-8a19-6256baee7bd5"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}}
,{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}}
,{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}}
,{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}}
,{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}}
,{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}}
,{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}}
,{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

Notes:
1) Don't forget to remove the Assets2 table!
2) Updating the COMMENT for a single delta table is simple, though for updating the same for a list of tables might be task for a PySpark job. It makes sense to provide the respective metadata from the start, when that's possible. Anyway, the information should be available in lakehouse's data dictionary.
3) One can reset the COMMENT to NULL or to an empty string.

Happy coding!

Resources:

[1] Databaricks (2023) ALTER TABLE (link)
[2] Databaricks (2023) COMMENT ON (link)

SQL Troubles

Pages

14 February 2025

🏭🧊🗒️Microsoft Fabric: Partitions in Lakehouses [Notes]

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

10 April 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Import data from CSV Files into Delta Table

10 March 2024

🏭📑Microsoft Fabric: Medallion Architecture [Notes]

17 February 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part I: Notebooks)

07 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Views and other Data Objects

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

About Me