In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.
INFORMATION_SCHEMA in SQL Endpoint
Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:-- retrieve the list of tables SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE' ORDER BY TABLE_SCHEMA
TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE |
Testing | dbo | city | BASE TABLE |
Testing | dbo | assets | BASE TABLE |
-- retrieve column metadata SELECT TABLE_CATALOG , TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , ORDINAL_POSITION , DATA_TYPE , CHARACTER_MAXIMUM_LENGTH , NUMERIC_PRECISION , NUMERIC_SCALE , DATETIME_PRECISION , CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'assets' ORDER BY ORDINAL_POSITION
Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.
Table's Properties
The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):
-- describe table's details DESCRIBE DETAIL Assets;
Attribute | Value |
format | delta |
id | 83e87b3c-28f4-417f-b4f5-842f6ba6f26d |
name | spark_catalog.testing.assets |
description | NULL |
location | abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets |
createdAt | 2024-02-05T16:18:54Z |
lastModified | 2024-02-05T16:29:52Z |
partitionColumns | |
numFiles | 1 |
sizeInBytes | 3879 |
properties | [object Object] |
minReaderVersion | 1 |
minWriterVersion | 2 |
tableFeatures | appendOnly,invariants |
-- table metadata SELECT t.object_id , schema_name(t.schema_id) schema_name , t.name table_name , t.type_desc , t.create_date , t.modify_date , t.durability_desc , t.temporal_type_desc , t.data_retention_period_unit_desc FROM sys.tables t WHERE name = 'assets'
Attribute | Value |
object_id | 1264723558 |
schema_name | dbo |
table_name | assets |
type_desc | USER_TABLE |
create_date | 2024-02-05 16:19:04.810 |
modify_date | 2024-02-05 16:19:04.810 |
durability_desc | SCHEMA_AND_DATA |
temporal_type_desc | NON_TEMPORAL_TABLE |
data_retention_period_unit_desc | INFINITE |
-- show table's definition DESCRIBE TABLE Assets;
col_name | data_type | comment |
Id | int | NULL |
CreationDate | timestamp | NULL |
Vendor | string | NULL |
Asset | string | NULL |
Model | string | NULL |
Owner | string | NULL |
Tag | string | NULL |
Quantity | decimal(13,2) | NULL |
-- show table's definition (extended) DESCRIBE TABLE EXTENDED Assets;
Label | Value |
Name | spark_catalog.testing.assets |
Type | MANAGED |
Location | abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets |
Provider | delta |
Owner | trusted-service-user |
Table Properties | [delta.minReaderVersion=1,delta.minWriterVersion=2] |
--show table properties SHOW TBLPROPERTIES Assets;
-- retrieve column metadata DESCRIBE Assets Assets.CreationDate;
Output:
info_name | info_value |
col_name | CreationDate |
data_type | timestamp |
comment | NULL |
In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):
%%pyspark import pyarrow.dataset as pq import os import re def show_metadata(file_path, table_name): #returns a delta table's metadata print(f"\{table_name}:") schema_properties = pq.dataset(file_path).schema.metadata if schema_properties: for key, value in schema_properties.items(): print(f"{key.decode('utf-8')}:{value.decode('utf-8')}") else: print("No properties available!") #main code path = "/lakehouse/default/Tables" tables = [f for f in os.listdir(path) if re.match('asset',f)] for table in tables: show_metadata(f"/{path}/"+ table, table)
\assets: org.apache.spark.version:3.4.1 org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[ {"name":"Id","type":"integer","nullable":true,"metadata":{}}, {"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}}, {"name":"Vendor","type":"string","nullable":true,"metadata":{}}, {"name":"Asset","type":"string","nullable":true,"metadata":{}}, {"name":"Model","type":"string","nullable":true,"metadata":{}}, {"name":"Owner","type":"string","nullable":true,"metadata":{}}, {"name":"Tag","type":"string","nullable":true,"metadata":{}}, {"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]} com.microsoft.parquet.vorder.enabled:true com.microsoft.parquet.vorder.level:9
Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:
tables = os.listdir(path)
References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)
No comments:
Post a Comment