Showing posts with label OPTIMIZE. Show all posts
Showing posts with label OPTIMIZE. Show all posts

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

Delta Table's Files

In a previous post, I tested the CRUD operations against a Microsoft Fabric's delta table. To view the files behind, you'll need to navigate to the Lakehouse, right click on the Assets table and pick 'View files' from the floating menu:

Assets' delta table files in Lakehouse

One can navigate within the folder structure, look at the transaction log files under _delta_log folder, though there seems to be no way to download the respective files. For this purpose, alternatively, you can use the Azure Storage Explorer (downloadable here). After installing the Explorer use the 'ADLSGen2 container or directory' option to connect to the folder where the Assets table was created (see [1]).

Assets' delta table files in Azure Storage Explorer

And, here's the content of the _delta_log from where the files can be downloaded:

Assets' log files in Azure Storage Explorer

The _temporary folder is empty.

It seems that for each cell with a DML/DDL command a log file was created. Therefore, it should make no difference if you run the cells individually or as a whole, respectively whether the statements are included in only one cell. 

The first file holds the content generated when creating the table:

{"commitInfo":{"timestamp":1707006399711,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"c3e6f24e-fa78-4941-a29d-28aac3a25926"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"a9a45d67-bcf9-4bef-8f48-c6b8f7a64f58","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1707006399640}}

The second file holds the content generated when inserting the first line in the table:

{"commitInfo":{"timestamp":1707006405806,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"3713"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"ac51fa78-8384-4e53-bb02-40d65632a218"}}
{"add":{"path":"part-00000-b50df7ae-66d8-41fd-a54d-6238e053d269-c000.snappy.parquet","partitionValues":{},"size":3713,"modificationTime":1707006405683,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":
{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The third file holds the content for inserting the multiple lines, while the fourth file contains the content for the UPDATE statement:

{"commitInfo":{"timestamp":1707006417538,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Id#1952 = 6)\"]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"2297","scanTimeMs":"1613","numAddedFiles":"1","numUpdatedRows":"1","numAddedBytes":"3870","rewriteTimeMs":"681"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"70f0f718-331a-48fc-8419-9f80599ca889"}}
{"remove":{"path":"part-00000-0bd7b054-1bde-44cd-a852-80332d98ae8b-c000.snappy.parquet","deletionTimestamp":1707006417533,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","partitionValues":{},"size":3870,"modificationTime":1707006417436,"dataChange":true,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":2,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 2\",\"Model\":\"Model 2\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The fifth file, holds the content related to the DELETE statement:

{"commitInfo":{"timestamp":1707006422147,"operation":"DELETE","operationParameters":{"predicate":"[\"(Id#2665 = 2)\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"1425","numDeletedRows":"1","scanTimeMs":"908","numAddedFiles":"1","numAddedBytes":"3837","rewriteTimeMs":"517"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"1f61449f-6c4a-40ce-80ed-6ff1a9a834e9"}}
{"remove":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","deletionTimestamp":1707006422134,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-bee2e3a5-7def-4466-94ff-ccd86aa37b8c-c000.snappy.parquet","partitionValues":{},"size":3837,"modificationTime":1707006422055,"dataChange":true,"stats":"{\"numRecords\":4,\"minValues\":{\"Id\":3,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 3\",\"Model\":\"Model 3\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"WX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

Before looking at each file, let's note the engineInfo's information that denotes the Spark, respectively the Delta Lake versions: "Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8". It's important to check these versions against the documentation when troubleshooting!

In each file, besides the "timestamp", observe the values of "operation", "operationParameters", "operationMetrics", "dataChange" and "VORDER". 

Operation indicate the DML or DDL statements run. As per my understanding dataChange and VORDER are set to true. 

The DESCRIBE HISTORY Command

You can retrieve a history of the actions performed on a table for the past 30 days via the DESCRIBE command, which can be entered in a notebook's cell:

-- describe table's history (30 days of data)
DESCRIBE HISTORY Assets;

Attributes' definition can be found in [3], see 'History schema' section. For big tables, it might be a good idea to limit command's output only to several records via the LIMIT keyword:

-- describe table's history (30 days of data, last 5 records)
DESCRIBE HISTORY Assets LIMIT 5;

Table Maintenance

Running multiple change commands on a delta table will result in many such small files, which impacts the read access to data and metadata. The files can be compacted via the table maintenance (see also the notes on delta tables). The OPTIMIZE command allows to consolidate multiple small Parquet files into larger files:

-- compacting the files
OPTIMIZE Assets;

Alternatively, you can do the same from the Lakehouse, navigate to the Assets table, right click on it and from the floating menu select Maintenance and run the job with the provided (see also [4]):

Run maintenance commands for a table

Running the command generated a new file:

{"commitInfo":{"timestamp":1707150591870,"operation":"OPTIMIZE","operationParameters":{"predicate":"[]","zOrderBy":"[]","auto":false},"readVersion":4,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"7550","p25FileSize":"3879","numDeletionVectorsRemoved":"0","minFileSize":"3879","numAddedFiles":"1","maxFileSize":"3879","p75FileSize":"3879","p50FileSize":"3879","numAddedBytes":"3879"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"b6466509-9f1a-467d-ad99-a0e95ed694ec"}}
{"add":{"path":"part-00000-453e9218-fe59-4138-a7dc-0e71c94c07d2-c000.snappy.parquet","partitionValues":{},"size":3879,"modificationTime":1707150591746,"dataChange":false,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":1,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-cbaaf2dc-b8d6-44bc-a4e1-97db0ba96074-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3713,"tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-6714ec80-1bd8-41f3-b8b7-5197025ec08f-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3837,"tags":{"VORDER":"true"}}}

Happy coding!

Resources:
[1] Microsoft Azure (2023) Azure Storage Explorer (link)
[2] Sam Debruyn (2023) Exploring OneLake with Microsoft Azure Storage Explorer (link)
[3] Delta Lake (2023) Table utility commands (link)
[4] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.