SQL Troubles: Azure

Showing posts with label Azure. Show all posts

08 March 2025

🏭🎗️🗒️Microsoft Fabric: Eventstreams [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 8-Mar-2025

Real-Time Intelligence architecture [4]

[Microsoft Fabric] Eventstream(s)

{def} feature in Microsoft Fabric's Real-Time Intelligence experience, that allows to bring real-time events into Fabric

bring real-time events into Fabric, transform them, and then route them to various destinations without writing any code

⇐ aka no-code solution
{feature} drag and drop experience

gives users an intuitive and easy way to create your event data processing, transforming, and routing logic without writing any code

work by creating a pipeline of events from multiple internal and external sources to different destinations

a conveyor belt that moves data from one place to another [1]
transformations to the data can be added along the way [1]

filtering, aggregating, or enriching

{def} eventstream

an instance of the Eventstream item in Fabric [2]
{feature} end-to-end data flow diagram

provide a comprehensive understanding of the data flow and organization [2].

{feature} eventstream visual editor

used to design pipelines by dragging and dropping different nodes [1]
sources

where event data comes from
one can choose

the source type
the data format
the consumer group

Azure Event Hubs

allows to get event data from an Azure event hub [1]
allows to create a cloud connection with the appropriate authentication and privacy level [1]

Azure IoT Hub

SaaS service used to connect, monitor, and manage IoT assets with a no-code experience [1]

CDC-enabled databases

software process that identifies and tracks changes to data in a database, enabling real-time or near-real-time data movement [1]
Azure SQL Database
PostgreSQL Database
MySQL Database
Azure Cosmos DB

Google Cloud Pub/Sub

messaging service for exchanging event data among applications and services [1]

Amazon Kinesis Data Streams

collect, process, and analyze real-time, streaming data [1]

Confluent Cloud Kafka

fully managed service based on Apache Kafka for stream processing [1]

Fabric workspace events

events triggered by changes in Fabric Workspace

e.g. creating, updating, or deleting items.

allows to capture, transform, and route events for in-depth analysis and monitoring within Fabric [1]
the integration offers enhanced flexibility in tracking and understanding workspace activities [1]

Azure blob storage events

system triggers for actions like creating, replacing, or deleting a blob [1]

these actions are linked to Fabric events

allowing to process Blob Storage events as continuous data streams for routing and analysis within Fabric [1]

support streamed or unstreamed events [1]

custom endpoint

REST API or SDKs can be used to send event data from custom app to eventstream [1]
allows to specify the data format and the consumer group of the custom app [1]

sample data

out-of-box sample data

destinations

where transformed event data is stored.

in a table in an eventhouse or a lakehouse [1]
redirect data to

another eventstream for further processing [1]
an activator to trigger an action [1]

Eventhouse

offers the capability to funnel your real-time event data into a KQL database [1]

Lakehouse

allows to preprocess real-time events before their ingestion in the lakehouse

the events are transformed into Delta Lake format and later stored in specific lakehouse tables [1]

facilitating the data warehousing needs [1]

custom endpoint

directs real-time event traffic to a bespoke application [1]
enables the integration of proprietary applications with the event stream, allowing for the immediate consumption of event data [1]
{scenario} aim to transfer real-time data to an independent system not hosted on the Microsoft Fabric [1]

Derived Stream

specialized destination created post-application of stream operations like Filter or Manage Fields to an eventstream
represents the altered default stream after processing, which can be routed to various destinations within Fabric and monitored in the Real-Time hub [1]

Fabric Activator

enables to use Fabric Activator to trigger automated actions based on values in streaming data [1]

transformations

filter or aggregate the data as is processed from the stream [1]
include common data operations

filtering

filter events based on the value of a field in the input
depending on the data type (number or text), the transformation keeps the values that match the selected condition, such as is null or is not null [1]

joining

transformation that combines data from two streams based on a matching condition between them [1]

aggregating

calculates an aggregation every time a new event occurs over a period of time [1]

Sum, Minimum, Maximum, or Average

allows renaming calculated columns, and filtering or slicing the aggregation based on other dimensions in your data [1]
one can have one or more aggregations in the same transformation [1]

grouping

allows to calculate aggregations across all events within a certain time window [1]

one can group by the values in one or more fields [1]

allows for the renaming of columns

similar to the Aggregate transformation
⇐ provides more options for aggregation and includes more complex options for time windows [1]

allows to add more than one aggregation per transformation [1]

allows to define the logic needed for processing, transforming, and routing event data [1]

union

allows to connect two or more nodes and add events with shared fields (with the same name and data type) into one table [1]

fields that don't match are dropped and not included in the output [1]

expand

array transformation that allows to create a new row for each value within an array [1]

manage fields

allows to add, remove, change data type, or rename fields coming in from an input or another transformation [1]

temporal windowing functions

enable to analyze data events within discrete time periods [1]
way to perform operations on the data contained in temporal windows [1]

e.g. aggregating, filtering, or transforming streaming events that occur within a specified time period [1]
allow analyzing streaming data that changes over time [1]

e.g. sensor readings, web-clicks, on-line transactions, etc.
provide great flexibility to keep an accurate record of events as they occur [1]

{type} tumbling windows

divides incoming events into fixed and nonoverlapping intervals based on arrival time [1]

{type} sliding windows

take the events into fixed and overlapping intervals based on time and divides them [1]

{type} session windows

divides events into variable and nonoverlapping intervals that are based on a gap of lack of activity [1]

{type} hopping windows

are different from tumbling windows as they model scheduled overlapping window [1]

{type} snapshot windows

group event stream events that have the same timestamp and are unlike the other windowing functions, which require the function to be named [1]
one can add the System.Timestamp() to the GROUP BY clause [1]

{type} window duration

the length of each window interval [1]
can be in seconds, minutes, hours, and even days [1]

{parameter} window offset

optional parameter that shifts the start and end of each window interval by a specified amount of time [1]

{concept} grouping key

one or more columns in an event data use to group the data by [1]

aggregation function

one or more of the functions applied to each group of events in each window [1]

where the counts, sums, averages, min/max, and even custom functions become useful [1]

see the event data flowing through the pipeline in real-time [1]
handles the scaling, reliability, and security of event stream automatically [1]

no need to write any code or manage any infrastructure [1]

{feature} eventstream editing canvas

used to

add and manage sources and destinations [1]
see the event data [1]
check the data insights [1]
view logs for each source or destination [1]

{feature} Apache Kafka endpoint on the Eventstream item

{benefit} enables users to connect and consume streaming events through the Kafka protocol [2]

application using the protocol can send or receive streaming events with specific topics [2]
requires updating the connection settings to use the Kafka endpoint provided in the Eventstream [2]

{feature} support runtime logs and data insights for the connector sources in Live View mode [3]

allows to examine detailed logs generated by the connector engines for the specific connector [3]

help with identifying failure causes or warnings [3]
⇐ accessible in the bottom pane of an eventstream by selecting the relevant connector source node on the canvas in Live View mode [3]

{feature} support data insights for the connector sources in Live View mode [3]
{feature} integrates eventstreams CI/CD tools

{benefit} developers can efficiently build and maintain eventstreams from end-to-end in a web-based environment, while ensuring source control and smooth versioning across projects [3]

{feature} REST APIs

allow to automate and manage eventstreams programmatically

{benefit} simplify CI/CD workflows and making it easier to integrate eventstreams with external applications [3]

{recommendation} use event streams feature with at least SKU: F4 [2]
{limitation} maximum message size: 1 MB [2]
{limitation} maximum retention period of event data: 90 days [2]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2024) Microsoft Fabric: Use real-time eventstreams in Microsoft Fabric [link]
[2] Microsoft Learn (2025) Microsoft Fabric: Fabric Eventstream - overview [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What's new in Fabric event streams? [link]

[4] Microsoft Learn (2025) Real Time Intelligence L200 Pitch Deck [link]

[5] Microsoft Learn (2025) Use real-time eventstreams in Microsoft Fabric [link]

Resources:

[R1] Microsoft Learn (2024) Microsoft Fabric exercises [link]
[R2] Microsoft Fabric Updates Blog (2024) CI/CD – Git Integration and Deployment Pipeline [link]

[R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration/Continuous Delivery
DB - database
IoT - Internet of Things
KQL - Kusto Query Language
RTI - Real-Time Intelligence

SaaS - Software-as-a-Service
SDK - Software Development Kit
SKU - Stock Keeping Unit

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 20-Jan-2025

[Azure] Service Principal (SPN)

{def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]

can be assigned precise permissions, making them perfect for automated processes or background services

allows to minimize the risks of human error and identity-based vulnerabilities
supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
authentication type

supported only by [2]

Azure Data Lake Storage
Azure Data Lake Storage Gen2
Azure Blob Storage
Azure Synapse Analytics
Azure SQL Database
Dataverse
SharePoint online

doesn’t support

SQL data source with Direct Query in datasets [2]

when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]

the access to resources is restricted by the roles assigned to the SPN

⇒ gives control over which resources can be accessed and at which level [4]

{recommendation} use SPN with automated tools [4]

rather than allowing them to sign in with a user identity [4]

{prerequisite} an active Microsoft Entra user account with sufficient permissions to

register an application with the tenant [4]
assign to the application a role in the Azure subscription [4]
⇐ requires Application.ReadWrite.All permission [4]

extended to support Fabric Data Warehouses [1]

{benefit} automation-friendly API Access

allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
enables to automate repetitive tasks without relying on user credentials [1]

e.g. provisioning or managing warehouses
increases security by limiting human error

the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
applicable to users with administrator, member, or contributor workspace role [3]
minimizes risk

the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]

Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]

{benefit} seamless integration with Client Tools:

tools like SSMS can connect to the Fabric DWH using SPN [1]
SPN provides secure access for developers to

run COPY INTO

with and without firewall enabled storage [1]

run any T-SQL query programmatically on a schedule with ADF pipelines [1]

{benefit} granular access control

Warehouses can be shared with an SPN through the Fabric portal [1]

once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]

allows to control precisely which data and operations an SPN has access to [1]

GRANT SELECT ON <table name> TO <Service principal name>

warehouses' ownership can be changed from an SPN to user, and vice-versa [3]

{benefit} improved DevOps and CI/CD Integration

SPN can be used to automate the deployment and management of DWH resources [1]

⇐ ensures faster, more reliable deployment processes while maintaining strong security postures [1]

{limitation} default semantic models are not supported for SPN created warehouses [3]

⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]

{limitation} SPN for SQL analytics endpoints is not currently supported
{limitation} SPNs are currently not supported for COPY INTO error files [3]

⇐ Entra ID credentials are not supported as well [3]

{limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
monitoring tools

[DMV] sys.dm_exec_sessions.login_name column [3]
[Query Insights] queryinsights.exec_requests_history.login_name [3]
Query activity

submitter column in Fabric query activity [3]

Capacity metrics app:

compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

Previous Post <<||>> Next Post

References:

[1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]

[2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]

[3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link]

[4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]

[5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ADF - Azure Data Factory

API - Application Programming Interface

CI/CD - Continuous Integration/Continuous Deployment

DMV - Dynamic Management View

DWH - Data Warehouse

SPN - service principal

SSMS - SQL Server Management Studio

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

Data Warehousing Series

Microsoft recently announced [4] the preview of a new Fabric feature called Mirroring, a low-cost, low-latency fully managed service that allows to replicate data from various systems together into OneLake [1]. Currently only Azure SQL Database, Azure Cosmos DB, and Snowflake are supported, though probably more database vendors will be targeted soon.

For Microsoft Fabric's data engineers, data scientists and data warehouse professionals this feature is huge as importance because they don't need to care anymore about making the data available in Microsoft Fabric, which involves a considerable amount of work.

Usually, at least for flexibility, transparence, performance and standardization, data professionals prefer to extract the data 1:1 from the source systems into a landing zone in the data warehouse or data/delta lake from where the data are further processed as needed. One data pipeline is thus built for every table in scope, which sometimes is a 10–15-minute effort per table, when the process is standardized, though upon case the effort is much higher if troubleshooting (e.g. data type incompatibility or support) or further logic changes are involved. Maintaining such data pipelines can prove to be costly over time, especially when periodic changes are needed.

Microsoft lists other downsides of the ETL approach - restricted access to data changes, friction between people, processes, and technology, respectively the effort needed to create the pipelines, and the time needed for importing the data [1]. There's some truth is each of these points, though everything is relative. For big tables, however, refreshing all the data overnight can prove to be time-consuming and costly, especially when the data don't lie within the same region, respectively data center. Unless the data can be refreshed incrementally, the night runs can extend into the day, will all the implications that derive from this - not having actual data, which decreases the trust in reports, etc. There are tricks to speed up the process, though there are limits to what can be done.

With mirroring, the replication of data between data sources and the analytics platform is handled in the background, after an initial replication, the changes in the source systems being reflected with a near real-time latency into OneLake, which is amazing! This allows building near real-time reporting solutions which can help the business in many ways - reviewing (and correcting in the data source) records en masse, faster overview of what's happening in the organizations, faster basis for decision-making, etc. Moreover, the mechanism is fully managed by Microsoft, which is thus responsible for making sure that the data are correctly synchronized. Only from this perspective 10-20% from the effort of building an analytics solution is probably reduced.

Mirroring in Microsoft Fabric (adapted after [2])

According to the documentation, one can replicate a whole database or choose individual regular tables (currently views aren't supported [3]), stop, restart, or remove a table from a mirroring. Moreover, through sharing, users can grant to other users or groups of users access to a mirrored database without giving access to the workspace and the rest of its items [1].

The data professionals and citizens can write then cross-database queries against the mirrored databases, warehouses, and the SQL analytics endpoints of lakehouses, combining data from all these sources into a single T-SQL query, which opens lot of opportunities especially in what concerns the creation of an enterprise semantic model, which should be differentiated from the semantic model created by default by the mirroring together with the SQL analytics endpoint.

Considering that the data is replicated into delta tables, one can take advantage of all the capabilities available with such tables - data versioning, time travel, interoperability and/or performance, respectively direct consumption in Power BI.

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn - Microsoft Fabric (2024) What is Mirroring in Fabric? (link)
[2] Microsoft Learn - Microsoft Fabric (2024) Mirroring Azure SQL Database [Preview] (link)
[3] Microsoft Learn - Microsoft Fabric (2024) Frequently asked questions for Mirroring Azure SQL Database in Microsoft Fabric [Preview] (link)
[4] Microsoft Fabric Updates Blog (2024) Announcing the Public Preview of Mirroring in Microsoft Fabric, by Charles Webb (link)

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.

INFORMATION_SCHEMA in SQL Endpoint

Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA

Output:

TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
Testing	dbo	city	BASE TABLE
Testing	dbo	assets	BASE TABLE

The same schema can be used to list columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'assets'
ORDER BY ORDINAL_POSITION

Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.

Table's Properties

The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):

-- describe table's details
DESCRIBE DETAIL Assets;

Output (transposed):

Attribute	Value
format	delta
id	83e87b3c-28f4-417f-b4f5-842f6ba6f26d
name	spark_catalog.testing.assets
description	NULL
location	abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets
createdAt	2024-02-05T16:18:54Z
lastModified	2024-02-05T16:29:52Z
partitionColumns
numFiles	1
sizeInBytes	3879
properties	[object Object]
minReaderVersion	1
minWriterVersion	2
tableFeatures	appendOnly,invariants

One can export the table metadata also from the sys.tables via the SQL Endpoint, though will be considered also SQL Server based metadata:

-- table metadata
SELECT t.object_id
, schema_name(t.schema_id) schema_name
, t.name table_name
, t.type_desc
, t.create_date 
, t.modify_date
, t.durability_desc
, t.temporal_type_desc
, t.data_retention_period_unit_desc
FROM sys.tables t
WHERE name = 'assets'

Output (transposed):

Attribute	Value
object_id	1264723558
schema_name	dbo
table_name	assets
type_desc	USER_TABLE
create_date	2024-02-05 16:19:04.810
modify_date	2024-02-05 16:19:04.810
durability_desc	SCHEMA_AND_DATA
temporal_type_desc	NON_TEMPORAL_TABLE
data_retention_period_unit_desc	INFINITE

Note:

1) There seem to be thus two different repositories for storing the metadata, thing reflected also in the different timestamps.

Table's Definition

A table's definition can be easily exported via the SQL Endpoint in SQL Server Management Studio. Multiple tables' definition can be exported as well via the Object explorer details within the same IDE.

In Spark SQL you can use the DESCRIBE TABLE command:

-- show table's definition
DESCRIBE TABLE Assets;

Output:

col_name	data_type	comment
Id	int	NULL
CreationDate	timestamp	NULL
Vendor	string	NULL
Asset	string	NULL
Model	string	NULL
Owner	string	NULL
Tag	string	NULL
Quantity	decimal(13,2)	NULL

Alternatively, you can use the EXTENDED keyword with the previous command to show further table information:

-- show table's definition (extended)
DESCRIBE TABLE EXTENDED Assets;

Output (only the records that appear under '# Detailed Table Information'):

Label	Value
Name	spark_catalog.testing.assets
Type	MANAGED
Location	abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets
Provider	delta
Owner	trusted-service-user
Table Properties	[delta.minReaderVersion=1,delta.minWriterVersion=2]

Note that the table properties can be listed individually via the SHOW TBLPROPERTIES command:

--show table properties
SHOW TBLPROPERTIES Assets;

A column's definition can be retrieved via a DESCRIBE command following the syntax:

-- retrieve column metadata
DESCRIBE Assets Assets.CreationDate;

Output:

info_name	info_value
col_name	CreationDate
data_type	timestamp
comment	NULL

In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):

%%pyspark
import pyarrow.dataset as pq
import os
import re

def show_metadata(file_path, table_name):
    #returns a delta table's metadata 

    print(f"\{table_name}:")
    schema_properties = pq.dataset(file_path).schema.metadata
    if schema_properties:
        for key, value in schema_properties.items():
            print(f"{key.decode('utf-8')}:{value.decode('utf-8')}")
    else:
        print("No properties available!")

#main code
path = "/lakehouse/default/Tables"
tables = [f for f in os.listdir(path) if re.match('asset',f)]

for table in tables:
    show_metadata(f"/{path}/"+ table, table)

Output:

\assets:
org.apache.spark.version:3.4.1
org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[
{"name":"Id","type":"integer","nullable":true,"metadata":{}},
{"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}},
{"name":"Vendor","type":"string","nullable":true,"metadata":{}},
{"name":"Asset","type":"string","nullable":true,"metadata":{}},
{"name":"Model","type":"string","nullable":true,"metadata":{}},
{"name":"Owner","type":"string","nullable":true,"metadata":{}},
{"name":"Tag","type":"string","nullable":true,"metadata":{}},
{"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]}
com.microsoft.parquet.vorder.enabled:true
com.microsoft.parquet.vorder.level:9

Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:

tables = os.listdir(path)

Notes:
The above views from the INFORMATION_SCHEMA, respectively sys schemas are available in SQL databases in Microsoft Fabric as well.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

Delta Table's Files

In a previous post, I tested the CRUD operations against a Microsoft Fabric's delta table. To view the files behind, you'll need to navigate to the Lakehouse, right click on the Assets table and pick 'View files' from the floating menu:

Assets' delta table files in Lakehouse

One can navigate within the folder structure, look at the transaction log files under _delta_log folder, though there seems to be no way to download the respective files. For this purpose, alternatively, you can use the Azure Storage Explorer (downloadable here). After installing the Explorer use the 'ADLSGen2 container or directory' option to connect to the folder where the Assets table was created (see [1]).

Assets' delta table files in Azure Storage Explorer

And, here's the content of the _delta_log from where the files can be downloaded:

Assets' log files in Azure Storage Explorer

The _temporary folder is empty.

It seems that for each cell with a DML/DDL command a log file was created. Therefore, it should make no difference if you run the cells individually or as a whole, respectively whether the statements are included in only one cell.

The first file holds the content generated when creating the table:

{"commitInfo":{"timestamp":1707006399711,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"c3e6f24e-fa78-4941-a29d-28aac3a25926"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"a9a45d67-bcf9-4bef-8f48-c6b8f7a64f58","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1707006399640}}

The second file holds the content generated when inserting the first line in the table:

{"commitInfo":{"timestamp":1707006405806,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"3713"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"ac51fa78-8384-4e53-bb02-40d65632a218"}}
{"add":{"path":"part-00000-b50df7ae-66d8-41fd-a54d-6238e053d269-c000.snappy.parquet","partitionValues":{},"size":3713,"modificationTime":1707006405683,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":
{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The third file holds the content for inserting the multiple lines, while the fourth file contains the content for the UPDATE statement:

{"commitInfo":{"timestamp":1707006417538,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Id#1952 = 6)\"]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"2297","scanTimeMs":"1613","numAddedFiles":"1","numUpdatedRows":"1","numAddedBytes":"3870","rewriteTimeMs":"681"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"70f0f718-331a-48fc-8419-9f80599ca889"}}
{"remove":{"path":"part-00000-0bd7b054-1bde-44cd-a852-80332d98ae8b-c000.snappy.parquet","deletionTimestamp":1707006417533,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","partitionValues":{},"size":3870,"modificationTime":1707006417436,"dataChange":true,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":2,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 2\",\"Model\":\"Model 2\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The fifth file, holds the content related to the DELETE statement:

{"commitInfo":{"timestamp":1707006422147,"operation":"DELETE","operationParameters":{"predicate":"[\"(Id#2665 = 2)\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"1425","numDeletedRows":"1","scanTimeMs":"908","numAddedFiles":"1","numAddedBytes":"3837","rewriteTimeMs":"517"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"1f61449f-6c4a-40ce-80ed-6ff1a9a834e9"}}
{"remove":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","deletionTimestamp":1707006422134,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-bee2e3a5-7def-4466-94ff-ccd86aa37b8c-c000.snappy.parquet","partitionValues":{},"size":3837,"modificationTime":1707006422055,"dataChange":true,"stats":"{\"numRecords\":4,\"minValues\":{\"Id\":3,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 3\",\"Model\":\"Model 3\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"WX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

Before looking at each file, let's note the engineInfo's information that denotes the Spark, respectively the Delta Lake versions: "Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8". It's important to check these versions against the documentation when troubleshooting!

In each file, besides the "timestamp", observe the values of "operation", "operationParameters", "operationMetrics", "dataChange" and "VORDER".

Operation indicate the DML or DDL statements run. As per my understanding dataChange and VORDER are set to true.

The DESCRIBE HISTORY Command

You can retrieve a history of the actions performed on a table for the past 30 days via the DESCRIBE command, which can be entered in a notebook's cell:

-- describe table's history (30 days of data)
DESCRIBE HISTORY Assets;

Attributes' definition can be found in [3], see 'History schema' section. For big tables, it might be a good idea to limit command's output only to several records via the LIMIT keyword:

-- describe table's history (30 days of data, last 5 records)
DESCRIBE HISTORY Assets LIMIT 5;

Table Maintenance

Running multiple change commands on a delta table will result in many such small files, which impacts the read access to data and metadata. The files can be compacted via the table maintenance (see also the notes on delta tables). The OPTIMIZE command allows to consolidate multiple small Parquet files into larger files:

-- compacting the files
OPTIMIZE Assets;

Alternatively, you can do the same from the Lakehouse, navigate to the Assets table, right click on it and from the floating menu select Maintenance and run the job with the provided (see also [4]):

Run maintenance commands for a table

Running the command generated a new file:

{"commitInfo":{"timestamp":1707150591870,"operation":"OPTIMIZE","operationParameters":{"predicate":"[]","zOrderBy":"[]","auto":false},"readVersion":4,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"7550","p25FileSize":"3879","numDeletionVectorsRemoved":"0","minFileSize":"3879","numAddedFiles":"1","maxFileSize":"3879","p75FileSize":"3879","p50FileSize":"3879","numAddedBytes":"3879"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"b6466509-9f1a-467d-ad99-a0e95ed694ec"}}
{"add":{"path":"part-00000-453e9218-fe59-4138-a7dc-0e71c94c07d2-c000.snappy.parquet","partitionValues":{},"size":3879,"modificationTime":1707150591746,"dataChange":false,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":1,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-cbaaf2dc-b8d6-44bc-a4e1-97db0ba96074-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3713,"tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-6714ec80-1bd8-41f3-b8b7-5197025ec08f-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3837,"tags":{"VORDER":"true"}}}

Happy coding!

Resources:
[1] Microsoft Azure (2023) Azure Storage Explorer (link)
[2] Sam Debruyn (2023) Exploring OneLake with Microsoft Azure Storage Explorer (link)
[3] Delta Lake (2023) Table utility commands (link)
[4] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)

SQL Troubles

Pages

08 March 2025

🏭🎗️🗒️Microsoft Fabric: Eventstreams [Notes]

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

18 April 2024

🏭Data Warehousing: Microsoft Fabric (Part II: Data(base) Mirroring) [New feature]

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

About Me