SQL Troubles: 💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.

INFORMATION_SCHEMA in SQL Endpoint

Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA

Output:

TABLE_CATALOG	TABLE_SCHEMA	TABLE_NAME	TABLE_TYPE
Testing	dbo	city	BASE TABLE
Testing	dbo	assets	BASE TABLE

The same schema can be used to list columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'assets'
ORDER BY ORDINAL_POSITION

Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.

Table's Properties

The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):

-- describe table's details
DESCRIBE DETAIL Assets;

Output (transposed):

Attribute	Value
format	delta
id	83e87b3c-28f4-417f-b4f5-842f6ba6f26d
name	spark_catalog.testing.assets
description	NULL
location	abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets
createdAt	2024-02-05T16:18:54Z
lastModified	2024-02-05T16:29:52Z
partitionColumns
numFiles	1
sizeInBytes	3879
properties	[object Object]
minReaderVersion	1
minWriterVersion	2
tableFeatures	appendOnly,invariants

One can export the table metadata also from the sys.tables via the SQL Endpoint, though will be considered also SQL Server based metadata:

-- table metadata
SELECT t.object_id
, schema_name(t.schema_id) schema_name
, t.name table_name
, t.type_desc
, t.create_date 
, t.modify_date
, t.durability_desc
, t.temporal_type_desc
, t.data_retention_period_unit_desc
FROM sys.tables t
WHERE name = 'assets'

Output (transposed):

Attribute	Value
object_id	1264723558
schema_name	dbo
table_name	assets
type_desc	USER_TABLE
create_date	2024-02-05 16:19:04.810
modify_date	2024-02-05 16:19:04.810
durability_desc	SCHEMA_AND_DATA
temporal_type_desc	NON_TEMPORAL_TABLE
data_retention_period_unit_desc	INFINITE

Note:

1) There seem to be thus two different repositories for storing the metadata, thing reflected also in the different timestamps.

Table's Definition

A table's definition can be easily exported via the SQL Endpoint in SQL Server Management Studio. Multiple tables' definition can be exported as well via the Object explorer details within the same IDE.

In Spark SQL you can use the DESCRIBE TABLE command:

-- show table's definition
DESCRIBE TABLE Assets;

Output:

col_name	data_type	comment
Id	int	NULL
CreationDate	timestamp	NULL
Vendor	string	NULL
Asset	string	NULL
Model	string	NULL
Owner	string	NULL
Tag	string	NULL
Quantity	decimal(13,2)	NULL

Alternatively, you can use the EXTENDED keyword with the previous command to show further table information:

-- show table's definition (extended)
DESCRIBE TABLE EXTENDED Assets;

Output (only the records that appear under '# Detailed Table Information'):

Label	Value
Name	spark_catalog.testing.assets
Type	MANAGED
Location	abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets
Provider	delta
Owner	trusted-service-user
Table Properties	[delta.minReaderVersion=1,delta.minWriterVersion=2]

Note that the table properties can be listed individually via the SHOW TBLPROPERTIES command:

--show table properties
SHOW TBLPROPERTIES Assets;

A column's definition can be retrieved via a DESCRIBE command following the syntax:

-- retrieve column metadata
DESCRIBE Assets Assets.CreationDate;

Output:

info_name	info_value
col_name	CreationDate
data_type	timestamp
comment	NULL

In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):

%%pyspark
import pyarrow.dataset as pq
import os
import re

def show_metadata(file_path, table_name):
    #returns a delta table's metadata 

    print(f"\{table_name}:")
    schema_properties = pq.dataset(file_path).schema.metadata
    if schema_properties:
        for key, value in schema_properties.items():
            print(f"{key.decode('utf-8')}:{value.decode('utf-8')}")
    else:
        print("No properties available!")

#main code
path = "/lakehouse/default/Tables"
tables = [f for f in os.listdir(path) if re.match('asset',f)]

for table in tables:
    show_metadata(f"/{path}/"+ table, table)

Output:

\assets:
org.apache.spark.version:3.4.1
org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[
{"name":"Id","type":"integer","nullable":true,"metadata":{}},
{"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}},
{"name":"Vendor","type":"string","nullable":true,"metadata":{}},
{"name":"Asset","type":"string","nullable":true,"metadata":{}},
{"name":"Model","type":"string","nullable":true,"metadata":{}},
{"name":"Owner","type":"string","nullable":true,"metadata":{}},
{"name":"Tag","type":"string","nullable":true,"metadata":{}},
{"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]}
com.microsoft.parquet.vorder.enabled:true
com.microsoft.parquet.vorder.level:9

Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:

tables = os.listdir(path)

Notes:
The above views from the INFORMATION_SCHEMA, respectively sys schemas are available in SQL databases in Microsoft Fabric as well.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)

SQL Troubles

Pages

06 February 2024

💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata I (General Information)

No comments:

About Me