Showing posts with label CRUD. Show all posts
Showing posts with label CRUD. Show all posts

26 February 2025

💎🤖SQL Reloaded: Copilot Stories (Part II: Create a Schema for Customer Addresses)

Since the first post on a related topic, Copilot became available in Microsoft 365, and this opens the door for prompt engineering experiments. A prompt is a "natural language instruction that tells a large language model (LLM) to perform a task" [1], process also known as instruction tuning. Prompt engineering is thus "the process of creating and refining the prompt used by the model" [1].

For example, one can use Prompt Engineering to learn how to perform certain tasks in a programming language like SQL: learning how to define a table, write a query and database objects, etc. In a first experiment I created a set of prompts for creating a set of objects to store and perform CRUD (Create, Read, Update, Delete) operations. The experiment focused on Customer Addresses, Countries, Regions. Let's look at the prompts and their outputs. The following prompts depict the attempts in the order in which they were considered, so feel free to experiment!

Before Copilot, I used macros in Excel or stored procedures to generate similar code. which allowed me to reduce considerably the development time.

Prompt:
Give me the definition of a SQL Server stored procedure for updating the Customers from the dbo schema based on input parameters. 

Output:

CREATE PROCEDURE UpdateCustomer
    @CustomerID INT,
    @CustomerName NVARCHAR(100),
    @ContactName NVARCHAR(100),
    @Country NVARCHAR(50)
AS
BEGIN
    UPDATE dbo.Customers
    SET 
        CustomerName = @CustomerName,
        ContactName = @ContactName,
        Country = @Country
    WHERE 
        CustomerID = @CustomerID;
END;

Prompt:
Give me the definition of a SQL Server stored procedure for inserting Customers in the dbo schema based on input parameters.

Output:

CREATE PROCEDURE InsertCustomer
    @CustomerName NVARCHAR(100),
    @ContactName NVARCHAR(100),
    @Country NVARCHAR(50)
AS
BEGIN
    INSERT INTO dbo.Customers (CustomerName, ContactName, Country)
    VALUES (@CustomerName, @ContactName, @Country);
END;

Prompt:
Give me the definition of a SQL Server stored procedure for deleting a Customers from the dbo schema based on Customer Id.

Output:

CREATE PROCEDURE DeleteCustomer
    @CustomerID INT
AS
BEGIN
    DELETE FROM dbo.Customers
    WHERE CustomerID = @CustomerID;
END;

Prompt:
Give me the definition of the SQL Server table used for the above examples.

Output:

CREATE TABLE dbo.Customers (
    CustomerID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerName NVARCHAR(100) NOT NULL,
    ContactName NVARCHAR(100),
    Country NVARCHAR(50)
);

Prompt:
Extend the definition of the previous table with Address information. 

Output:

CREATE TABLE dbo.Customers (
    CustomerID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerName NVARCHAR(100) NOT NULL,
    ContactName NVARCHAR(100),
    Country NVARCHAR(50),
    AddressLine1 NVARCHAR(200),
    AddressLine2 NVARCHAR(200),
    City NVARCHAR(100),
    State NVARCHAR(100),
    PostalCode NVARCHAR(20)
);

Prompt:
Remove the Address information from the above table into a separate Addresses table with the Customer Id as foreign key.

Output:

-- Modified Customers Table:
CREATE TABLE dbo.Customers (
    CustomerID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerName NVARCHAR(100) NOT NULL,
    ContactName NVARCHAR(100),
    Country NVARCHAR(50)
);

-- New Addresses Table:
CREATE TABLE dbo.Addresses (
    AddressID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerID INT,
    AddressLine1 NVARCHAR(200),
    AddressLine2 NVARCHAR(200),
    City NVARCHAR(100),
    State NVARCHAR(100),
    PostalCode NVARCHAR(20),
    FOREIGN KEY (CustomerID) REFERENCES dbo.Customers(CustomerID)
);

Prompt:
Create a Countries table that can be referenced from the Addresses table with Country ISO codes.

Output:

--Create the Countries Table:
CREATE TABLE dbo.Countries (
    CountryCode CHAR(2) PRIMARY KEY, -- ISO 3166-1 alpha-2 code
    CountryName NVARCHAR(100) NOT NULL
);

--Modify the Addresses Table:
CREATE TABLE dbo.Addresses (
    AddressID INT IDENTITY(1,1) PRIMARY KEY,
    CustomerID INT,
    AddressLine1 NVARCHAR(200),
    AddressLine2 NVARCHAR(200),
    City NVARCHAR(100),
    State NVARCHAR(100),
    PostalCode NVARCHAR(20),
    CountryCode CHAR(2),
    FOREIGN KEY (CustomerID) REFERENCES dbo.Customers(CustomerID),
    FOREIGN KEY (CountryCode) REFERENCES dbo.Countries(CountryCode)
);

Prompt:
Extend the Countries table with ISO 3 characters code and ISO Number.

Output:

CREATE TABLE dbo.Countries (
    CountryCode CHAR(2) PRIMARY KEY, -- ISO 3166-1 alpha-2 code
    CountryName NVARCHAR(100) NOT NULL,
    ISO3Code CHAR(3), -- ISO 3166-1 alpha-3 code
    ISONumericCode INT -- ISO 3166-1 numeric code
);

Prompt:
Create a Regions table with Region code and description that allows to group the countries in territories. A country can be assigned to multiple regions.

Output:

-- Create the Regions Table:
CREATE TABLE dbo.Regions (
    RegionCode CHAR(3) PRIMARY KEY,
    RegionDescription NVARCHAR(100) NOT NULL
);

-- Create the Countries Table (including the previous columns):
CREATE TABLE dbo.Countries (
    CountryCode CHAR(2) PRIMARY KEY, -- ISO 3166-1 alpha-2 code
    CountryName NVARCHAR(100) NOT NULL,
    ISO3Code CHAR(3), -- ISO 3166-1 alpha-3 code
    ISONumericCode INT -- ISO 3166-1 numeric code
);

-- Create the CountryRegions Junction Table:
CREATE TABLE dbo.CountryRegions (
    CountryCode CHAR(2),
    RegionCode CHAR(3),
    PRIMARY KEY (CountryCode, RegionCode),
    FOREIGN KEY (CountryCode) REFERENCES dbo.Countries(CountryCode),
    FOREIGN KEY (RegionCode) REFERENCES dbo.Regions(RegionCode)
);

Prompt:
Create a view based on Customers that includes all the above tables. 

Output:

CREATE VIEW CustomerDetails AS
SELECT 
    c.CustomerID,
    c.CustomerName,
    c.ContactName,
    a.AddressLine1,
    a.AddressLine2,
    a.City,
    a.State,
    a.PostalCode,
    co.CountryCode,
    co.CountryName,
    co.ISO3Code,
    co.ISONumericCode,
    r.RegionCode,
    r.RegionDescription
FROM 
    dbo.Customers c
LEFT JOIN 
    dbo.Addresses a ON c.CustomerID = a.CustomerID
LEFT JOIN 
    dbo.Countries co ON a.CountryCode = co.CountryCode
LEFT JOIN 
    dbo.CountryRegions cr ON co.CountryCode = cr.CountryCode
LEFT JOIN 
    dbo.Regions r ON cr.RegionCode = r.RegionCode;

Prompt:
In the above view use 3 letter encoding for table's name. 

Output:

CREATE VIEW CustomerDetails AS
SELECT 
    cus.CustomerID,
    cus.CustomerName,
    cus.ContactName,
    add.AddressLine1,
    add.AddressLine2,
    add.City,
    add.State,
    add.PostalCode,
    cou.CountryCode,
    cou.CountryName,
    cou.ISO3Code,
    cou.ISONumericCode,
    reg.RegionCode,
    reg.RegionDescription
FROM 
    dbo.Customers cus
LEFT JOIN 
    dbo.Addresses add ON cus.CustomerID = add.CustomerID
LEFT JOIN 
    dbo.Countries cou ON add.CountryCode = cou.CountryCode
LEFT JOIN 
    dbo.CountryRegions cr ON cou.CountryCode = cr.CountryCode
LEFT JOIN 
    dbo.Regions reg ON cr.RegionCode = reg.RegionCode;

Unfortunately, the definition uses the "add" reserved word for addresses as prefix. Therefore, the following prompt:

Prompt:
Modify the above view to use 3-letter aliases that aren't as reserved words. 

Output:

CREATE VIEW dbo.CustomerDetails AS
SELECT 
    cus.CustomerID, -- Customer ID
    cus.CustomerName, -- Customer name
    cus.ContactName, -- Contact name
    adr.AddressLine1, -- First line of the address
    adr.AddressLine2, -- Second line of the address
    adr.City, -- City of the address
    adr.State, -- State or region of the address
    adr.PostalCode, -- Postal or ZIP code of the address
    cou.CountryCode, -- Country code
    cou.CountryName, -- Country name
    cou.ISO3Code, -- ISO 3166-1 alpha-3 code
    cou.ISONumericCode, -- ISO 3166-1 numeric code
    reg.RegionCode, -- Region code
    reg.RegionDescription -- Region description
FROM 
    dbo.Customers cus
LEFT JOIN 
    dbo.Addresses adr ON cus.CustomerID = adr.CustomerID
LEFT JOIN 
    dbo.Countries cou ON adr.CountryCode = cou.CountryCode
LEFT JOIN 
    dbo.CountryRegions cr ON cou.CountryCode = cr.CountryCode
LEFT JOIN 
    dbo.Regions reg ON cr.RegionCode = reg.RegionCode;

Prompt:

Modify the schema of the above objects to include the schema name.

Prompt:
Change the definition of the above tables and test whether the tables already exists.

Prompt:
Annotate the definition of the above objects with a comment.

Prompt:
Generate the scripts for inserting 10 customers with different addresses in the above tables.

Prompt:
Aggregate the update and insert commands into the same stored procedure.

Output:

CREATE PROCEDURE dbo.UpsertCustomer
    @CustomerID INT = NULL, -- Optional parameter for insert
    @CustomerName NVARCHAR(100),
    @ContactName NVARCHAR(100),
    @Country NVARCHAR(50)
AS
BEGIN
    IF EXISTS (SELECT 1 FROM dbo.Customers WHERE CustomerID = @CustomerID)
    BEGIN
        -- Update existing customer
        UPDATE dbo.Customers
        SET 
            CustomerName = @CustomerName,
            ContactName = @ContactName,
            Country = @Country
        WHERE 
            CustomerID = @CustomerID;
    END
    ELSE
    BEGIN
        -- Insert new customer
        INSERT INTO dbo.Customers (CustomerName, ContactName, Country)
        VALUES (@CustomerName, @ContactName, @Country);
    END
END;

I used to create similar code in the past to consolidate the logic for inserts and updates within the same stored procedure. This allows to reduce the number of objects created in a database and minimize the impact of some other issues (e.g. code duplication for validation).

I also use the comma before each line of text where it applies (and not at the end). However, this can be addressed with the following prompt:

Prompt: 
Rewrite the above database objects by butting the comma in front of each line where it applies.

Prompt:
Modify the view so each join condition is written on its own line.

Prompt:
Modify the view with create or alter command.

Prompt:
Modify the view to check for null values.

Comments:
1) I could use the scripts also in a SQL database, though one must pay attention to the order in which the steps must be run.
2) All these are basic steps; so, a natural question: how far can we go? One can generate further objects, and knowledge about the data model seems to be available in Copilot, though normal data models are much more complex than this! Probably, there's a limit beyond which Copilot will start to be inefficient as long the data model is not available explicitly. SQL Server Copilot would help probably to overcome such limitations, though the feature is still in limited preview.
3) I wonder whether given a set of best practices, Copilot will be able to use them and create the CRUD objects accordingly. More experiments are needed though.
4) It will be interesting to observe how much the prompts generated by other people lead to similar or different outcomes. (I observed that nontechnical prompts generated in the past different outcomes.)
5) The fact that I was able to change some formatting (e.g. comma before each line) for all objects with just a prompt frankly made my day! I can't stress enough how many hours of work the unformatted code required in the past! It will be interesting to check if this can be applied also to a complex database schema.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft 365 (2024) Overview of prompts [link]

Resources:
[R1] Microsoft 365 (2025) Microsoft 365 Copilot release notes [link]
[R2] Microsoft 365 (2025) What’s new in Microsoft 365 Copilot | Jan 2025 [link]
[R3] Microsoft 365 (2025) Copilot is now included in Microsoft 365 Personal and Family [link]

Acronyms:
LLM - large language model
CRUD - Create, Read, Update, Delete

31 March 2024

🏭🗒️Microsoft Fabric: Polaris [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Last updated: 31-Mar-2024

[Microsoft Fabric] Polaris

  • {def} cloud-native analytical query engine over the data lake that follows a stateless micro-service architecture and is designed to execute queries in a scalable, dynamic and fault-tolerant way [1], [2]
    • the engine behind the serverless SQL pool [1] and Microsoft Fabric [2]
    • petabyte-scale execution [1]
    • highly-available micro-service architecture
      • data and query processing is packaged into units (aka tasks) [1]
        • can be readily moved across compute nodes and re-started at the task level [1]
    • can run directly over data in HDFS and in managed transactional stores [1]
  • [Azure Synapse] designed initially to execute read-only queries [1]
    • ⇐ the architecture behind serverless SQL pool
    • uses a completely new scale-out framework based on a distributed SQL Server query engine [1]
        • fully compatible with T-SQL
        • leverages SQL Server single-node runtime and QO [1]
  • [Microsoft Fabric] extended with a complete transaction manager that executes general CRUD transactions [2]
    • incl. updates, deletes and bulk loads [2]
    • based on [delta tables] and [delta lake]
      • the delta lake supports currently only transactions within one table [4]
    • ⇐ the architecture behind lakehouses
  • {goal} converge DWH and big data workloads [1]
    • the query engine scales-out for relational data and heterogeneous datasets stored in DFSs[1]
      • needs a clean abstraction over the underlying data type and format, capturing just what’s needed for efficiently parallelizing data processing
  • {goal} separate compute and state for cloud-native execution [1]
    • all services within a pool are stateless
      • data is stored durably in remote storage and is abstracted via data cells [1]
        • ⇐ data is naturally decoupled from compute nodes
    • the metadata and transactional log state is off-loaded to centralized services [[1]
    • multiple compute pools can transactionally access the same logical database [1]
  • {goal} cloud-first [2]
    • {benefit} leverages elasticity
    • transactions need to be resilient to node failures on dynamically changing topologies [2]
      •  ⇒ the storage engine disaggregates the source of truth for execution state (including data, metadata and transactional state) from compute nodes [2]
    • must ensure disaggregation of metadata and transactional state from compute nodes [2]
      • ⇐ to ensure that the life span of a transaction is resilient to changes in the backend compute topology [2]
        • ⇐ can change dynamically to take advantage of the elastic nature of the cloud or to handle node failures [2]
  • {goal} use optimized native columnar, immutable and open storage format [2]
    • uses delta format 
      • ⇐ optimized to handle read-heavy workloads with low contention [2] 
  • {goal} leverage the full potential of vectorized query processing for SQL [2]
  • {goal} support zero-copy data sharing with other services in the lake [2]
  • {goal} support read-heavy workloads with low contention [2]
  • {goal} support lineage-based features [2]
    • by taking advantage of delta table capabilities 
  • {goal} provide full SQL SI transactional support [2]
    • {benefit} all traditional DWH requirements are met [2]
      • incl. multi-table and multi-statement transactions [2]
        • ⇐ Polaris is the only system that supports this [2]
        • the design is optimized for analytics, specifically read- and insert-intensive workloads [2]
        • mixes of transactions are supported as well
  • {objective} no cross-component state sharing [2] 
    • {principle} encapsulation of state within each component to avoid sharing state across nodes [2]
    • SI and the isolation of state across components allows to execute transactions as if they were queries [2]
      • ⇒ makes read and write transactions indistinguishable [2]
        • ⇒ allows to fully leverage its optimized distributed execution framework [2]
  • {objective} support snapshot Isolation (SI) semantics [2]
    • implemented over versioned data
    • allows reads (R) and writes (W) to proceed concurrently over their own data snapshot 
      • R/W never conflict, and W/W of active transactions only conflict if they modify the same data [2] 
      • ⇐ all W transactions are serializable, leading to a serial schedule in increasing order of log record IDs [4]
        • follows from the commit protocol for write transactions, where only one transaction can write the record with each record ID [4]
      • ⇐  R transactions at the snapshot isolation level create no contention
        •  ⇒  any number of R transactions can run concurrently [4]
    • the immutable data representation in LSTs allows dealing with failures by simply discarding data and metadata files that represent uncommitted changes [2]
      • similar to how temporary tables are discarded during query processing failures [2]
  • {feature} resize live workloads [1]
    • scales resources with the workloads automatically
  • {feature} deliver predictable performance at scale [1]
    • scales computational resources based on workloads' needs
  • {feature} efficiently handle both relational and unstructured data [1]
  • {feature} flexible, fine-grained task monitoring
    • a task is the finest grain of execution 
  • {feature} global resource-aware scheduling
    • enables much better resource utilization and concurrency than traditional DWHs
      • capable of handling partial query restarts
      • maintains a global view of multiple queries
    • it is planned to build on this a global view with autonomous workload management features
  • {feature} multi-layered data caching model
    • leverages 
      • SQL Server buffer pools for cashing columnar data
      • SSD caching
    • the delta table and its log are are immutable, they can be safely cached on cluster nodes [4]
  • {feature} tracks data lineage natively
    • the transaction log can also be used to audit logging based on the commit Info records [4]
  • {feature} versioning
    • maintain all versions as data is updated [1]
  • {feature} time-travel
    • {benefit} allows users query point-in-time snapshots
    • {benefit)} allows to roll back erroneous updates to the data.
  • {feature} table cloning
    • {benefit} allows to create a point-in-time snapshot of the data based on its metadata
  • {concept} state 
    • allows to drive the end-to-end life cycle of a SQL statement with transactional guarantees and top tier performance [1]
    • comprised of 
      • cache
      • metadata
      • transaction logs
      • data
    • [on-premises architecture] all state is in the compute layer
      • relies on small, highly stable and homogenous clusters with dedicated hardware for Tier-1 performance
      • {downside} expensive
      • {downside} hard to maintain
      • {downside} limited scalability
        • cluster capacity is bounded by machine sizes because of the fixed topology
  • {concept}[stateful architecture
    • the state of inflight transactions is stored in the compute node and is not hardened into persistent storage until the transaction commits [1]
      • ⇒ when a compute node fails, the state of non-committed transactions is lost [1] 
        •  ⇒ the in-flight transactions fail as well [1]
    • often also couples metadata describing data distributions and mappings to compute nodes [1] 
      • ⇒ a compute node effectively owns responsibility for processing a subset of the data [1] 
        • its ownership cannot be transferred without a cluster restart [1]
    • {downside} resilience to compute node failure and elastic assignment of data to compute are not possible [1]
  • {concept} stateless compute architecture
    • requires that compute nodes hold no state information [1]
      • ⇒ all data, transactional logs and metadata need to be externalized [1]
    • {benefit} allows applications to 
      • partially restart the execution of queries in the event of compute node failures [1] 
      • adapt to online changes of the cluster topology without failing in-flight transactions [1] 
    • caches need to be as close to the compute as possible [1] 
      • since they can be lazily reconstructed from persisted data they don’t necessarily need to be decoupled from compute [1] 
        • the coupling of caches and compute does not make the architecture stateful [1] 
  • {concept} [cloud] decoupling of compute and storage
    • provides more flexible resource scaling
      • the 2 layers can scale up and down independently adapting to user needs [1] 
      • customers pay for the compute needed to query a working subset of the data [1] 
    • is not the same as decoupling compute and state [1] 
      • if any of the remaining state held in compute cannot be reconstructed from external services, then compute remains stateful [1] 

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12)  (link)
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions (link)
[3] Advancing Analytics (2021) Azure Synapse Analytics - Polaris Whitepaper Deep-Dive (link)
[4] Michael Armbrust et al (2020) Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores, Proceedings of the VLDB Endowment 13(12) (link)

Acronyms:
ADLS - Azure Data Lake Storage
CRUD - Create, Read, Update, Delete
DCP - distributed computation platform 
DFS - Distributed File System
DWH - data warehouse
HDFS - Hadoop DFS
R/W - read/write
SI - Semantic Isolation 
SSD - Solid-State Drive

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

Delta Table's Files

In a previous post, I tested the CRUD operations against a Microsoft Fabric's delta table. To view the files behind, you'll need to navigate to the Lakehouse, right click on the Assets table and pick 'View files' from the floating menu:

Assets' delta table files in Lakehouse

One can navigate within the folder structure, look at the transaction log files under _delta_log folder, though there seems to be no way to download the respective files. For this purpose, alternatively, you can use the Azure Storage Explorer (downloadable here). After installing the Explorer use the 'ADLSGen2 container or directory' option to connect to the folder where the Assets table was created (see [1]).

Assets' delta table files in Azure Storage Explorer

And, here's the content of the _delta_log from where the files can be downloaded:

Assets' log files in Azure Storage Explorer

The _temporary folder is empty.

It seems that for each cell with a DML/DDL command a log file was created. Therefore, it should make no difference if you run the cells individually or as a whole, respectively whether the statements are included in only one cell. 

The first file holds the content generated when creating the table:

{"commitInfo":{"timestamp":1707006399711,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"c3e6f24e-fa78-4941-a29d-28aac3a25926"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"a9a45d67-bcf9-4bef-8f48-c6b8f7a64f58","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1707006399640}}

The second file holds the content generated when inserting the first line in the table:

{"commitInfo":{"timestamp":1707006405806,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"3713"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"ac51fa78-8384-4e53-bb02-40d65632a218"}}
{"add":{"path":"part-00000-b50df7ae-66d8-41fd-a54d-6238e053d269-c000.snappy.parquet","partitionValues":{},"size":3713,"modificationTime":1707006405683,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":
{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The third file holds the content for inserting the multiple lines, while the fourth file contains the content for the UPDATE statement:

{"commitInfo":{"timestamp":1707006417538,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Id#1952 = 6)\"]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"2297","scanTimeMs":"1613","numAddedFiles":"1","numUpdatedRows":"1","numAddedBytes":"3870","rewriteTimeMs":"681"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"70f0f718-331a-48fc-8419-9f80599ca889"}}
{"remove":{"path":"part-00000-0bd7b054-1bde-44cd-a852-80332d98ae8b-c000.snappy.parquet","deletionTimestamp":1707006417533,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","partitionValues":{},"size":3870,"modificationTime":1707006417436,"dataChange":true,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":2,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 2\",\"Model\":\"Model 2\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The fifth file, holds the content related to the DELETE statement:

{"commitInfo":{"timestamp":1707006422147,"operation":"DELETE","operationParameters":{"predicate":"[\"(Id#2665 = 2)\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"1425","numDeletedRows":"1","scanTimeMs":"908","numAddedFiles":"1","numAddedBytes":"3837","rewriteTimeMs":"517"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"1f61449f-6c4a-40ce-80ed-6ff1a9a834e9"}}
{"remove":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","deletionTimestamp":1707006422134,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-bee2e3a5-7def-4466-94ff-ccd86aa37b8c-c000.snappy.parquet","partitionValues":{},"size":3837,"modificationTime":1707006422055,"dataChange":true,"stats":"{\"numRecords\":4,\"minValues\":{\"Id\":3,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 3\",\"Model\":\"Model 3\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"WX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

Before looking at each file, let's note the engineInfo's information that denotes the Spark, respectively the Delta Lake versions: "Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8". It's important to check these versions against the documentation when troubleshooting!

In each file, besides the "timestamp", observe the values of "operation", "operationParameters", "operationMetrics", "dataChange" and "VORDER". 

Operation indicate the DML or DDL statements run. As per my understanding dataChange and VORDER are set to true. 

The DESCRIBE HISTORY Command

You can retrieve a history of the actions performed on a table for the past 30 days via the DESCRIBE command, which can be entered in a notebook's cell:

-- describe table's history (30 days of data)
DESCRIBE HISTORY Assets;

Attributes' definition can be found in [3], see 'History schema' section. For big tables, it might be a good idea to limit command's output only to several records via the LIMIT keyword:

-- describe table's history (30 days of data, last 5 records)
DESCRIBE HISTORY Assets LIMIT 5;

Table Maintenance

Running multiple change commands on a delta table will result in many such small files, which impacts the read access to data and metadata. The files can be compacted via the table maintenance (see also the notes on delta tables). The OPTIMIZE command allows to consolidate multiple small Parquet files into larger files:

-- compacting the files
OPTIMIZE Assets;

Alternatively, you can do the same from the Lakehouse, navigate to the Assets table, right click on it and from the floating menu select Maintenance and run the job with the provided (see also [4]):

Run maintenance commands for a table

Running the command generated a new file:

{"commitInfo":{"timestamp":1707150591870,"operation":"OPTIMIZE","operationParameters":{"predicate":"[]","zOrderBy":"[]","auto":false},"readVersion":4,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"7550","p25FileSize":"3879","numDeletionVectorsRemoved":"0","minFileSize":"3879","numAddedFiles":"1","maxFileSize":"3879","p75FileSize":"3879","p50FileSize":"3879","numAddedBytes":"3879"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"b6466509-9f1a-467d-ad99-a0e95ed694ec"}}
{"add":{"path":"part-00000-453e9218-fe59-4138-a7dc-0e71c94c07d2-c000.snappy.parquet","partitionValues":{},"size":3879,"modificationTime":1707150591746,"dataChange":false,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":1,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-cbaaf2dc-b8d6-44bc-a4e1-97db0ba96074-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3713,"tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-6714ec80-1bd8-41f3-b8b7-5197025ec08f-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3837,"tags":{"VORDER":"true"}}}

Happy coding!

Resources:
[1] Microsoft Azure (2023) Azure Storage Explorer (link)
[2] Sam Debruyn (2023) Exploring OneLake with Microsoft Azure Storage Explorer (link)
[3] Delta Lake (2023) Table utility commands (link)
[4] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.