SQL Troubles: Azure Storage Explorer

Showing posts with label Azure Storage Explorer. Show all posts

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.

INFORMATION_SCHEMA in SQL Endpoint

Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA

Output:

TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
Testing	dbo	city	BASE TABLE
Testing	dbo	assets	BASE TABLE

The same schema can be used to list columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'assets'
ORDER BY ORDINAL_POSITION

Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.

Table's Properties

The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):

-- describe table's details
DESCRIBE DETAIL Assets;

Output (transposed):

Attribute	Value
format	delta
id	83e87b3c-28f4-417f-b4f5-842f6ba6f26d
name	spark_catalog.testing.assets
description	NULL
location	abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets
createdAt	2024-02-05T16:18:54Z
lastModified	2024-02-05T16:29:52Z
partitionColumns
numFiles	1
sizeInBytes	3879
properties	[object Object]
minReaderVersion	1
minWriterVersion	2
tableFeatures	appendOnly,invariants

One can export the table metadata also from the sys.tables via the SQL Endpoint, though will be considered also SQL Server based metadata:

-- table metadata
SELECT t.object_id
, schema_name(t.schema_id) schema_name
, t.name table_name
, t.type_desc
, t.create_date 
, t.modify_date
, t.durability_desc
, t.temporal_type_desc
, t.data_retention_period_unit_desc
FROM sys.tables t
WHERE name = 'assets'

Output (transposed):

Attribute	Value
object_id	1264723558
schema_name	dbo
table_name	assets
type_desc	USER_TABLE
create_date	2024-02-05 16:19:04.810
modify_date	2024-02-05 16:19:04.810
durability_desc	SCHEMA_AND_DATA
temporal_type_desc	NON_TEMPORAL_TABLE
data_retention_period_unit_desc	INFINITE

Note:

1) There seem to be thus two different repositories for storing the metadata, thing reflected also in the different timestamps.

Table's Definition

A table's definition can be easily exported via the SQL Endpoint in SQL Server Management Studio. Multiple tables' definition can be exported as well via the Object explorer details within the same IDE.

In Spark SQL you can use the DESCRIBE TABLE command:

-- show table's definition
DESCRIBE TABLE Assets;

Output:

col_name	data_type	comment
Id	int	NULL
CreationDate	timestamp	NULL
Vendor	string	NULL
Asset	string	NULL
Model	string	NULL
Owner	string	NULL
Tag	string	NULL
Quantity	decimal(13,2)	NULL

Alternatively, you can use the EXTENDED keyword with the previous command to show further table information:

-- show table's definition (extended)
DESCRIBE TABLE EXTENDED Assets;

Output (only the records that appear under '# Detailed Table Information'):

Label	Value
Name	spark_catalog.testing.assets
Type	MANAGED
Location	abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets
Provider	delta
Owner	trusted-service-user
Table Properties	[delta.minReaderVersion=1,delta.minWriterVersion=2]

Note that the table properties can be listed individually via the SHOW TBLPROPERTIES command:

--show table properties
SHOW TBLPROPERTIES Assets;

A column's definition can be retrieved via a DESCRIBE command following the syntax:

-- retrieve column metadata
DESCRIBE Assets Assets.CreationDate;

Output:

info_name	info_value
col_name	CreationDate
data_type	timestamp
comment	NULL

In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):

%%pyspark
import pyarrow.dataset as pq
import os
import re

def show_metadata(file_path, table_name):
    #returns a delta table's metadata 

    print(f"\{table_name}:")
    schema_properties = pq.dataset(file_path).schema.metadata
    if schema_properties:
        for key, value in schema_properties.items():
            print(f"{key.decode('utf-8')}:{value.decode('utf-8')}")
    else:
        print("No properties available!")

#main code
path = "/lakehouse/default/Tables"
tables = [f for f in os.listdir(path) if re.match('asset',f)]

for table in tables:
    show_metadata(f"/{path}/"+ table, table)

Output:

\assets:
org.apache.spark.version:3.4.1
org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[
{"name":"Id","type":"integer","nullable":true,"metadata":{}},
{"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}},
{"name":"Vendor","type":"string","nullable":true,"metadata":{}},
{"name":"Asset","type":"string","nullable":true,"metadata":{}},
{"name":"Model","type":"string","nullable":true,"metadata":{}},
{"name":"Owner","type":"string","nullable":true,"metadata":{}},
{"name":"Tag","type":"string","nullable":true,"metadata":{}},
{"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]}
com.microsoft.parquet.vorder.enabled:true
com.microsoft.parquet.vorder.level:9

Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:

tables = os.listdir(path)

Notes:
The above views from the INFORMATION_SCHEMA, respectively sys schemas are available in SQL databases in Microsoft Fabric as well.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

Delta Table's Files

In a previous post, I tested the CRUD operations against a Microsoft Fabric's delta table. To view the files behind, you'll need to navigate to the Lakehouse, right click on the Assets table and pick 'View files' from the floating menu:

Assets' delta table files in Lakehouse

One can navigate within the folder structure, look at the transaction log files under _delta_log folder, though there seems to be no way to download the respective files. For this purpose, alternatively, you can use the Azure Storage Explorer (downloadable here). After installing the Explorer use the 'ADLSGen2 container or directory' option to connect to the folder where the Assets table was created (see [1]).

Assets' delta table files in Azure Storage Explorer

And, here's the content of the _delta_log from where the files can be downloaded:

Assets' log files in Azure Storage Explorer

The _temporary folder is empty.

It seems that for each cell with a DML/DDL command a log file was created. Therefore, it should make no difference if you run the cells individually or as a whole, respectively whether the statements are included in only one cell.

The first file holds the content generated when creating the table:

{"commitInfo":{"timestamp":1707006399711,"operation":"CREATE TABLE","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"c3e6f24e-fa78-4941-a29d-28aac3a25926"}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"a9a45d67-bcf9-4bef-8f48-c6b8f7a64f58","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1707006399640}}

The second file holds the content generated when inserting the first line in the table:

{"commitInfo":{"timestamp":1707006405806,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":0,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputRows":"1","numOutputBytes":"3713"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"ac51fa78-8384-4e53-bb02-40d65632a218"}}
{"add":{"path":"part-00000-b50df7ae-66d8-41fd-a54d-6238e053d269-c000.snappy.parquet","partitionValues":{},"size":3713,"modificationTime":1707006405683,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":
{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":1,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"IBM\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The third file holds the content for inserting the multiple lines, while the fourth file contains the content for the UPDATE statement:

{"commitInfo":{"timestamp":1707006417538,"operation":"UPDATE","operationParameters":{"predicate":"[\"(Id#1952 = 6)\"]"},"readVersion":2,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"2297","scanTimeMs":"1613","numAddedFiles":"1","numUpdatedRows":"1","numAddedBytes":"3870","rewriteTimeMs":"681"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"70f0f718-331a-48fc-8419-9f80599ca889"}}
{"remove":{"path":"part-00000-0bd7b054-1bde-44cd-a852-80332d98ae8b-c000.snappy.parquet","deletionTimestamp":1707006417533,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","partitionValues":{},"size":3870,"modificationTime":1707006417436,"dataChange":true,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":2,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 2\",\"Model\":\"Model 2\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

The fifth file, holds the content related to the DELETE statement:

{"commitInfo":{"timestamp":1707006422147,"operation":"DELETE","operationParameters":{"predicate":"[\"(Id#2665 = 2)\"]"},"readVersion":3,"isolationLevel":"Serializable","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"1","numRemovedBytes":"3870","numCopiedRows":"4","numAddedChangeFiles":"0","executionTimeMs":"1425","numDeletedRows":"1","scanTimeMs":"908","numAddedFiles":"1","numAddedBytes":"3837","rewriteTimeMs":"517"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"1f61449f-6c4a-40ce-80ed-6ff1a9a834e9"}}
{"remove":{"path":"part-00000-b5e4e62f-c938-4db9-a83a-d85850f310c0-c000.snappy.parquet","deletionTimestamp":1707006422134,"dataChange":true,"extendedFileMetadata":true,"partitionValues":{},"size":3870,"tags":{"VORDER":"true"}}}
{"add":{"path":"part-00000-bee2e3a5-7def-4466-94ff-ccd86aa37b8c-c000.snappy.parquet","partitionValues":{},"size":3837,"modificationTime":1707006422055,"dataChange":true,"stats":"{\"numRecords\":4,\"minValues\":{\"Id\":3,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 3\",\"Model\":\"Model 3\",\"Owner\":\"Owner 2\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-02-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"WX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}

Before looking at each file, let's note the engineInfo's information that denotes the Spark, respectively the Delta Lake versions: "Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8". It's important to check these versions against the documentation when troubleshooting!

In each file, besides the "timestamp", observe the values of "operation", "operationParameters", "operationMetrics", "dataChange" and "VORDER".

Operation indicate the DML or DDL statements run. As per my understanding dataChange and VORDER are set to true.

The DESCRIBE HISTORY Command

You can retrieve a history of the actions performed on a table for the past 30 days via the DESCRIBE command, which can be entered in a notebook's cell:

-- describe table's history (30 days of data)
DESCRIBE HISTORY Assets;

Attributes' definition can be found in [3], see 'History schema' section. For big tables, it might be a good idea to limit command's output only to several records via the LIMIT keyword:

-- describe table's history (30 days of data, last 5 records)
DESCRIBE HISTORY Assets LIMIT 5;

Table Maintenance

Running multiple change commands on a delta table will result in many such small files, which impacts the read access to data and metadata. The files can be compacted via the table maintenance (see also the notes on delta tables). The OPTIMIZE command allows to consolidate multiple small Parquet files into larger files:

-- compacting the files
OPTIMIZE Assets;

Alternatively, you can do the same from the Lakehouse, navigate to the Assets table, right click on it and from the floating menu select Maintenance and run the job with the provided (see also [4]):

Run maintenance commands for a table

Running the command generated a new file:

{"commitInfo":{"timestamp":1707150591870,"operation":"OPTIMIZE","operationParameters":{"predicate":"[]","zOrderBy":"[]","auto":false},"readVersion":4,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"7550","p25FileSize":"3879","numDeletionVectorsRemoved":"0","minFileSize":"3879","numAddedFiles":"1","maxFileSize":"3879","p75FileSize":"3879","p50FileSize":"3879","numAddedBytes":"3879"},"tags":{"VORDER":"true"},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"b6466509-9f1a-467d-ad99-a0e95ed694ec"}}
{"add":{"path":"part-00000-453e9218-fe59-4138-a7dc-0e71c94c07d2-c000.snappy.parquet","partitionValues":{},"size":3879,"modificationTime":1707150591746,"dataChange":false,"stats":"{\"numRecords\":5,\"minValues\":{\"Id\":1,\"CreationDate\":\"2024-01-01T00:00:00.000Z\",\"Vendor\":\"Dell\",\"Asset\":\"Laptop 1\",\"Model\":\"Model 1\",\"Owner\":\"Owner 1\",\"Tag\":\"DD0001\",\"Quantity\":1.00},\"maxValues\":{\"Id\":6,\"CreationDate\":\"2024-03-01T00:00:00.000Z\",\"Vendor\":\"Microsoft\",\"Asset\":\"Laptop 4\",\"Model\":\"Model 4\",\"Owner\":\"Owner 4\",\"Tag\":\"XX0001\",\"Quantity\":1.00},\"nullCount\":{\"Id\":0,\"CreationDate\":0,\"Vendor\":0,\"Asset\":0,\"Model\":0,\"Owner\":0,\"Tag\":0,\"Quantity\":0}}","tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-cbaaf2dc-b8d6-44bc-a4e1-97db0ba96074-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3713,"tags":{"VORDER":"true"}}}
{"remove":{"path":"part-00000-6714ec80-1bd8-41f3-b8b7-5197025ec08f-c000.snappy.parquet","deletionTimestamp":1707150580358,"dataChange":false,"extendedFileMetadata":true,"partitionValues":{},"size":3837,"tags":{"VORDER":"true"}}}

Happy coding!

Resources:
[1] Microsoft Azure (2023) Azure Storage Explorer (link)
[2] Sam Debruyn (2023) Exploring OneLake with Microsoft Azure Storage Explorer (link)
[3] Delta Lake (2023) Table utility commands (link)
[4] Microsoft Learn (2023) Use table maintenance feature to manage delta tables in Fabric (link)

SQL Troubles

Pages

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

05 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - CRUD Operations II (Looking at the Files)

About Me