In a previous post I've created a delta table called Assets. For troubleshooting and maintenance tasks it would be useful to retrieve a table's metadata - properties, definition, etc. There are different ways to extract the metadata, depending on the layer and method used.
INFORMATION_SCHEMA in SQL Endpoint
Listing all the delta tables available in a Lakehouse can be done using the INFORMATION_SCHEMA via the SQL Endpoint:-- retrieve the list of tables SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE' ORDER BY TABLE_SCHEMA
| TABLE_CATALOG | TABLE_SCHEMA | TABLE_NAME | TABLE_TYPE |
| Testing | dbo | city | BASE TABLE |
| Testing | dbo | assets | BASE TABLE |
-- retrieve column metadata SELECT TABLE_CATALOG , TABLE_SCHEMA , TABLE_NAME , COLUMN_NAME , ORDINAL_POSITION , DATA_TYPE , CHARACTER_MAXIMUM_LENGTH , NUMERIC_PRECISION , NUMERIC_SCALE , DATETIME_PRECISION , CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'assets' ORDER BY ORDINAL_POSITION
Note:
Given that the INFORMATION_SCHEMA is an ANSI-standard schema for providing metadata about a database's objects (including views, schemata), probably this is the best way to retrieve general metadata as the above (see [3]). More information over the SQL Endpoint can be obtained by querying directly the SQL Server metadata.
Table's Properties
The Delta Lake documentation reveals that a table's properties can be retrieved via the DESCRIBE command, which can be used in a notebook's cell (see [1] for attributes' definition):
-- describe table's details DESCRIBE DETAIL Assets;
| Attribute | Value |
| format | delta |
| id | 83e87b3c-28f4-417f-b4f5-842f6ba6f26d |
| name | spark_catalog.testing.assets |
| description | NULL |
| location | abfss://[…]@onelake.dfs.fabric.microsoft.com/[…]/Tables/assets |
| createdAt | 2024-02-05T16:18:54Z |
| lastModified | 2024-02-05T16:29:52Z |
| partitionColumns | |
| numFiles | 1 |
| sizeInBytes | 3879 |
| properties | [object Object] |
| minReaderVersion | 1 |
| minWriterVersion | 2 |
| tableFeatures | appendOnly,invariants |
-- table metadata SELECT t.object_id , schema_name(t.schema_id) schema_name , t.name table_name , t.type_desc , t.create_date , t.modify_date , t.durability_desc , t.temporal_type_desc , t.data_retention_period_unit_desc FROM sys.tables t WHERE name = 'assets'
| Attribute | Value |
| object_id | 1264723558 |
| schema_name | dbo |
| table_name | assets |
| type_desc | USER_TABLE |
| create_date | 2024-02-05 16:19:04.810 |
| modify_date | 2024-02-05 16:19:04.810 |
| durability_desc | SCHEMA_AND_DATA |
| temporal_type_desc | NON_TEMPORAL_TABLE |
| data_retention_period_unit_desc | INFINITE |
-- show table's definition DESCRIBE TABLE Assets;
| col_name | data_type | comment |
| Id | int | NULL |
| CreationDate | timestamp | NULL |
| Vendor | string | NULL |
| Asset | string | NULL |
| Model | string | NULL |
| Owner | string | NULL |
| Tag | string | NULL |
| Quantity | decimal(13,2) | NULL |
-- show table's definition (extended) DESCRIBE TABLE EXTENDED Assets;
| Label | Value |
| Name | spark_catalog.testing.assets |
| Type | MANAGED |
| Location | abfss://[...]@onelake.dfs.fabric.microsoft.com/[...]/Tables/assets |
| Provider | delta |
| Owner | trusted-service-user |
| Table Properties | [delta.minReaderVersion=1,delta.minWriterVersion=2] |
--show table properties SHOW TBLPROPERTIES Assets;
-- retrieve column metadata DESCRIBE Assets Assets.CreationDate;
Output:
| info_name | info_value |
| col_name | CreationDate |
| data_type | timestamp |
| comment | NULL |
In PySpark it's trickier to retrieve a table's definition (code adapted after Dennes Torres' presentation at Toboggan Winter edition 2024):
%%pyspark import pyarrow.dataset as pq import os import re def show_metadata(file_path, table_name): #returns a delta table's metadata print(f"\{table_name}:") schema_properties = pq.dataset(file_path).schema.metadata if schema_properties: for key, value in schema_properties.items(): print(f"{key.decode('utf-8')}:{value.decode('utf-8')}") else: print("No properties available!") #main code path = "/lakehouse/default/Tables" tables = [f for f in os.listdir(path) if re.match('asset',f)] for table in tables: show_metadata(f"/{path}/"+ table, table)
\assets: org.apache.spark.version:3.4.1 org.apache.spark.sql.parquet.row.metadata:{"type":"struct","fields":[ {"name":"Id","type":"integer","nullable":true,"metadata":{}}, {"name":"CreationDate","type":"timestamp","nullable":true,"metadata":{}}, {"name":"Vendor","type":"string","nullable":true,"metadata":{}}, {"name":"Asset","type":"string","nullable":true,"metadata":{}}, {"name":"Model","type":"string","nullable":true,"metadata":{}}, {"name":"Owner","type":"string","nullable":true,"metadata":{}}, {"name":"Tag","type":"string","nullable":true,"metadata":{}}, {"name":"Quantity","type":"decimal(13,2)","nullable":true,"metadata":{}}]} com.microsoft.parquet.vorder.enabled:true com.microsoft.parquet.vorder.level:9
Note:
One can list the metadata for all tables by removing the filter on the 'asset' table:
tables = os.listdir(path)
The above views from the INFORMATION_SCHEMA, respectively sys schemas are available in SQL databases in Microsoft Fabric as well.
References:
[1] Delta Lake (2023) Table utility commands (link)
[2] Databricks (2023) DESCRIBE TABLE (link)
[3] Microsoft Learn (2023) System Information Schema Views (link)
