Showing posts with label SQL Reloaded. Show all posts
Showing posts with label SQL Reloaded. Show all posts

10 April 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Import data from CSV Files into Delta Table

Microsoft provides a set of labs and exercises that can be used to learn working with data in Fabric, however the real learning comes when one considers an example that introduces something new. As I've downloaded some time ago an archive with several datasets on Sales forecast from the Kaggle website, I tried to import the Features dataset in different ways and see how it goes.

Store,Date,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,IsHoliday
1,2010-02-05,42.31,2.572,NA,NA,NA,NA,NA,211.0963582,8.106,FALSE
1,2010-02-12,38.51,2.548,NA,NA,NA,NA,NA,211.2421698,8.106,TRUE
1,2010-02-19,39.93,2.514,NA,NA,NA,NA,NA,211.2891429,8.106,FALSE
1,2010-02-26,46.63,2.561,NA,NA,NA,NA,NA,211.3196429,8.106,FALSE
1,2010-03-05,46.5,2.625,NA,NA,NA,NA,NA,211.3501429,8.106,FALSE

Within an existing lakehouse, one can import the CSV as it is via 'Files/Upload' and once the data is imported, once can navigate to the file and use 'Load to Tables/New Table' to important the data into a managed table. Unfortunately, because some of the numeric fields include also literal values "NA" for the columns with NULLs, their data type is considered as varchar(8000), which is not ideal for calculations:

-- table created via Load to Tables
CREATE TABLE [dbo].[walmart_features](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [float] NULL,
	[Fuel_Price] [float] NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]

This could be fixed by replacing the NA values with an empty value, which I did and used this version for the next steps. 

I tried then using Spark to import the data, though then all the fields are defined as varchar(8000).

-- table created via Spark
CREATE TABLE [dbo].[walmart_features2](
	[Store] [varchar](8000) NULL,
	[Date] [varchar](8000) NULL,
	[Temperature] [varchar](8000) NULL,
	[Fuel_Price] [varchar](8000) NULL,
	[MarkDown1] [varchar](8000) NULL,
	[MarkDown2] [varchar](8000) NULL,
	[MarkDown3] [varchar](8000) NULL,
	[MarkDown4] [varchar](8000) NULL,
	[MarkDown5] [varchar](8000) NULL,
	[CPI] [varchar](8000) NULL,
	[Unemployment] [varchar](8000) NULL,
	[IsHoliday] [varchar](8000) NULL
) ON [PRIMARY]
GO

So, is needed to define the schema explicitly, however I had to import the IsHoliday as string and cast the value explicitly to a Boolean using a second data frame (see alternatives):

from pyspark.sql.types import *
from pyspark.sql.functions import *

#define schema
featuresSchema = StructType([
      StructField("Store", IntegerType())
    , StructField("Date", DateType())
    , StructField("Temperature",  DecimalType(13,2))
    , StructField("Fuel_Price", DecimalType(13,2))
    , StructField("MarkDown1", DecimalType(13,2))
    , StructField("MarkDown2", DecimalType(13,2))
    , StructField("MarkDown3", DecimalType(13,2))
    , StructField("MarkDown4", DecimalType(13,2))
    , StructField("MarkDown5", DecimalType(13,2))
    , StructField("CPI", DecimalType(18,6))
    , StructField("Unemployment", DecimalType(13,2))
    , StructField("IsHoliday", StringType())
])

# Load a file into a dataframe
df = spark.read.load('Files/OpenSource/features2.csv'
    , format='csv'
    , schema = featuresSchema
    , header=True)

# do the conversion for isHoliday
df2 = df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType())) # Save the dataframe as a delta table df2.write.format("delta").saveAsTable("walmart_features3")

Now, table's definition looks much better:

-- table created via Spark with explicit schema
CREATE TABLE [dbo].[walmart_features3](
	[Store] [int] NULL,
	[Date] [date] NULL,
	[Temperature] [decimal](13, 2) NULL,
	[Fuel_Price] [decimal](13, 2) NULL,
	[MarkDown1] [decimal](13, 2) NULL,
	[MarkDown2] [decimal](13, 2) NULL,
	[MarkDown3] [decimal](13, 2) NULL,
	[MarkDown4] [decimal](13, 2) NULL,
	[MarkDown5] [decimal](13, 2) NULL,
	[CPI] [decimal](18, 6) NULL,
	[Unemployment] [decimal](13, 2) NULL,
	[IsHoliday] [bit] NULL
) ON [PRIMARY]
GO

Comments:
(1) I tried to apply the schema change directly on the initial data frame, though the schema didn't change:

df.withColumn("IsHoliday", df.IsHoliday.cast(BooleanType()))

(2) For the third method one could have left the NA in because by the conversion a NULL will be considered. Conversely, it might be needed to check if there are other values that fail the conversion. 

(3) The following warning in the Notebook when running the above code is a hint that something went wrong during the conversion (e.g. decimals were cut): 

"Your file(s) might include corrupted records"

(4) Especially for the transformed values it makes sense to look at the values (at last when the dataset isn't too big):

-- validating the values for the Boolean data field
SELECT IsHoliday
, count(*) NoRecords
FROM dbo.walmart_features3
GROUP BY IsHoliday
ORDER BY 1

(5) The tables can be deleted directly in the lakehouse or via PySpark (observe the catalog.table_name):

#dropping the table
spark.sql('DROP TABLE IF EXISTS UAT.walmart_features3') 

At the beginning probably it makes sense to remove "IF EXISTS" to make sure that the table is available.

(6) For those who run into a similar issue, the SQL Endpoint for the lakehouse is read only, therefore attempting to delete a table via SSMS will result in such an error:

Drop failed for Table 'dbo.walmart_features2'.  (Microsoft.SqlServer.Smo)
The external policy action 'Microsoft.Sql/Sqlservers/Databases/Schemas/Tables/Drop' was denied on the requested resource.
Cannot drop the table 'walmart_features2', because it does not exist or you do not have permission.

Happy coding!

09 April 2024

SQL Server New Features: Copilot in Azure SQL Database is in Private Preview (SQL Reloaded)

Following the news on Copilot and its adoption in the various areas, probably you asked yourself when it will become available in SQL Server. The good news is that the Copilot in Azure SQL Database is already in Private Preview (see [1], [2]). The bad news is that for the early adoption program, Microsoft looks only for customers that can provide feedback based on real-world use cases, and one will need a valid Tenant ID for signing up for the PP.

The new feature comes with two Azure portal experiences:

  • Natural language to SQL: within the Azure portal query editor the natural language prompts are translated into SQL;
  • Azure Copilot integration: provides customers with self-guided assistance for managing databases and troubleshooting. 
In the Data Exposed webcast with Anna Hoffman [1], the invitee, Joe Sack, showcases the use of both experiences. In what concerns the first experience, besides simple joins, Joe also showcased the use of a PIVOT based on the prompt:

"Show me a pivot summary table that displays the total number of properties sold in each year from 2020 to 2023".

-- output from Copilot
SELECT *
FROM 
(
	SELECT YEAR(SaleDate) AS Year, COUNT(*) AS TotalPropertiesSold
	FROM [PropertySearch].[dbo].[PropertyHistory]
	WHERE YEAR(SaleDate) BETWEEN 2020 AND 2023
	GROUP BY YEAR(SalesDate)
) AS PivotDate
PIVOT
(
	SUM(TotalPropertiesSold)
	FOR Year IN ([2020], [2021], [2022], [2023]
) AS PivotTable;

Even if this is a modest example, the great thing is that it generates the skeleton on which the logic can be built, which can really help beginners, though it can be also useful for more seasoned data professionals as it saves time. Probably, in a first phase this will be the main benefit of Copilot - to generate skeletons or templates on which further logic can be built. Hopefully in time it will do much more than that.

I wonder how complex the prompts can become and how can the logic be structured to create a multistep scenario. The Copilot versions from other areas showed that complex prompts give results, the question is whether Copilot can devise the steps in an optimum manner, much like a seasoned data professional does. 

The feature utilizes for the moment the table and view names, column names, primary key, and foreign key metadata to generate T-SQL code. Probably, it makes sense to also use index and statistics information, query hints and all the arsenal usually needed by data professionals to optimize a query. Conversely, maybe the second experience could be used for optimizing and troubleshooting the query. 

I'd really love to test this feature, though probably I'll need to wait until it becomes publicly available. In the meanwhile, one can play with the GitHub Copilot [3] or install Copilot in Azure Data Studio [4].

References:
[1] Data Exposed (2024) Introducing Copilot in Azure SQL Database (Private Preview) (link)
[2] Azure SQL Blog (2024) Microsoft Copilot in Azure extends capabilities to Azure SQL Database (Private Preview) by Joe Sack (link)
[3] Azure SQL Blog (2023) GitHub Copilot for SQL Developers: Turbocharge your SQL Development, by Subhojit Basak (link)
[4] Microsoft Developer (2023) Copilot is now in Azure Data Studio and this is how it can help you! (link)

05 April 2024

SQL Reloaded: SQL Antipatterns (Part I: JOINs, UNIONs & DISTINCT)

Introduction

SQL antipatterns refer in general to common mistakes made when developing SQL code, though the term can refer also to situations in which even if the code is syntactically and logically correct, it's either suboptimal, unclear or even incorrect. Therefore "mistake" can cover a wide range of scenarios, some that can be ignored, while others need to be addressed accordingly. 

In this post I consider a few antipatterns observed especially in data warehouses (DWHs). Let's look at the below code created to exemplify several scenarios:

-- Products in open orders (initial query)
SELECT DISTINCT ITM.ProductId                                   -- (1) use of DISTINCT
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
    LEFT JOIN (							-- (5) use of JOIN instead of EXISTS
	-- Open Purchase orders 
	SELECT DISTINCT POL.ProductId
	, 'POs' Source                                          -- (7) use columns not needed in output
	FROM Purchasing.PurchaseOrderDetail POL                 
	     LEFT JOIN Purchasing.PurchaseOrderHeader POH       -- (2) use of LEFT JOIN instead of FULL JOIN
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 -- pending 
	UNION					                -- (3) use of UNION
	-- Open Sales orders 
	SELECT DISTINCT SOL.ProductId
	, 'SOs' Source
	FROM Sales.SalesOrderDetail SOL
	    LEFT JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status = 1 -- in process		        -- (4) use of OR instead of IN
	   OR SOH.Status = 2 -- approved
	) DAT
	ON ITM.ProductID = DAT.ProductID
WHERE DAT.ProductID IS NOT NULL 
ORDER BY ITM.ProductNumber			                -- (6) using too many columns in ORDER BY
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size

(1) Use of DISTINCT 

DISTINCT is a dirty way to remove the duplicates from a dataset. Sometimes it makes sense to use it to check something fast, though it should be avoided into code intended for a production environment because it can lead to unexpected behavior especially when selecting all the columns using the "*" (SELECT DISTINCT *).

I saw tools and developers adding a DISTINCT in almost each step, independently on whether it was necessary or not. One can thus but wonder whether the DISTINCT was added to fix a bigger issue with the data in the DWH, to remove special duplicates imposed by the logic or just as poor practice. Unfortunately, when it's used frequently, it can become challenging to investigate its use and discover the actual issues from the DWH.

There are several approaches to eliminate DISTINCTs from the code: GROUP BY, ranking functions or upon case also code rewrites.

(2)  Use of LEFT JOIN Instead of FULL JOIN

When refreshing a DWH there can be the case that related data are out of synch. It would be the case of Purchase or Sales orders where the headers and lines are out of synch, headers existing without lines and/or vice versa. A common practice is to use a FULL JOIN and thus to eliminate such exceptions, though there are also entitled uses of a LEFT JOIN. This antipattern resumes however to the cases in which logically should be used a FULL JOIN, though a LEFT JOIN is used instead.

In the above example there are two distinct occurrences of this pattern: the relationship between the header and lines in the inner query, respectively the LEFT JOIN with a NOT NULL constraint in the outer query. The latter use is useful when during testing one wants to see all the Products, though bringing this further into production may rise some eyebrows even if it's not necessarily wrong. Anyway, the database engine should be smart enough to recognize such a scenario. However, for the header vs lines case, upon case the plan generated might be suboptimal. 

One of the best practices when writing SQL queries is to state one's intent clearly in what the logic concerns. Using a LEFT JOIN instead of a FULL JOIN can make people raise questions about the actual need. When the need is not properly documented, some developer may even go and change the joins. There can be for example business cases that are not cover by the current data, but as soon as case appears it will lead to incorrect logic!

Similarly, splitting a piece of logic into two or more steps unnecessarily can create confusion. There can be however also entitled s situations (e.g. query optimization), which ideally should be documented.

(3) Use of UNION

When a UNION is used, the values returned by the first query will be checked against the values of the second query, and thus unnecessary comparisons occur even if they are not needed. This depends also on the business context, which might not be easily to identify from the query (especially when the reviewer doesn't know the business case). 

The misuse of a UNION will not make a big difference when the volume of data is small, though the more data are processed by the query, the higher the impact. 

Besides the proper use of the UNION, there are also situations in which a query rewrite can eliminate the need for a UNION (see the rewritten query below).

(4) use of OR instead of IN

One can occasionally find queries in which a OR was used for 10 to 50 distinct values as in the example above. Even if the database engine generate in both cases the same query plan, it's easier to read and maintain a query that used IN. However, if the number of values go beyond a certain value, other techniques should be used to improve the performance.

The only benefit I see for a OR is to document meaning's values or remove during testing one of the values, especially when the list is a selection coming from the user. Frankly, it's not the appropriate way for documenting logic (even if I'm doing it sometimes in ad-hoc queries).

There's a more extreme scenario in which distinct subqueries are written for each or a set of ORs (e.g. the distinction between open vs closed vs. invoices orders), which can make sense sometimes (e.g. the logic is completely different). Therefore, an antipattern can be dependent also of the context or use case. 

(5) use of JOIN instead of EXISTS

When there are no values returned from the subquery, quite often it makes sense to the EXISTS or not EXISTS operators in the queries (see the rewritten query below). This might not be indicated however for distributed environments like serverless SQL pool in which the distribution of the processing across multiple tasks might benefit when the pieces of the logic distributed don't require heavy reshuffles. 

(6) Using too Many Columns in ORDER BY

The columns specified in an ORDER BY clause need to make sense, otherwise they just add extra burden on the database engine, at least from the perspective of the checks that need to be performed. In the above query, at least the Name doesn't make sense.

It helps also if the columns can use existing indexes, though this depends also on query specifics. 

Another antipattern scenario not exemplified above is the use of ordinals to refer to the columns, which should be avoided in production environments (because the order of the columns can be changed accidentally or even :

-- using ordinals instead of number columns (not recommended)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM                           
ORDER BY 2, 4, 5, 6

(7) Use Columns Not  Needed in Output

Besides the fact that each column included unnecessarily in the query can increase the size of the data processed (unless the database engine is smart to remove them), there can be also performance issues and/or optimizations involved. For example, if all the other columns are part of a covering index, the database engine might opt for a suboptimal index compared to the case in which the unnecessary columns are removed. 

Conversely, some columns are helpful to troubleshoot the logic (and that's why the Source column was considered) even if they aren't considered in the final output or the logic. It doesn't make sense to bring the query version with the respective fields into production, even if this would mean to have maybe a second version of the query used only for troubleshooting needs. Commenting the unnecessary columns could be a better choice, even if it's not recommended in general as too many such comments can obfuscate the case. 

Rewriting the Query

With the above input the query can be rewritten as follows:

-- Products in open orders (modified query)
SELECT ITM.ProductId                                  
, ITM.ProductNumber
, ITM.Name 
, ITM.Color 
, ITM.Style 
, ITM.Size
FROM Production.Product ITM
WHERE EXISTS (										            
	-- Open Purchase orders 
	SELECT POL.ProductId
	FROM Purchasing.PurchaseOrderDetail POL                 
	     JOIN Purchasing.PurchaseOrderHeader POH      
		  ON POL.PurchaseOrderID = POH.PurchaseOrderID
	WHERE POH.Status = 1 
	  AND ITM.ProductID = POL.ProductID
	)
 OR EXISTS (				                                    
	-- Open Sales orders 
	SELECT SOL.ProductId
	FROM Sales.SalesOrderDetail SOL
	     JOIN Sales.SalesOrderHeader SOH
		  ON SOL.SalesOrderID = SOH.SalesOrderID
	WHERE SOH.Status IN (1, 2)
	  AND ITM.ProductID = SOL.ProductID
	)	                           
ORDER BY ITM.ProductNumber			                           
, ITM.Color 
, ITM.Style 
, ITM.Size

Please note that in general to each rule there are also exceptions which should be considered against one's common sense. When the benefit of addressing an antipattern is neglectable compared with the effort involved and the logic doesn't create any issues, probably it's better to let the code as it. One can still reconsider the antipatterns later with the next refactoring opportunity. 

There are zealous data professionals who treat minor inconveniences (e.g. not using upper case for SQL reserved words, alternate code formatting, alternative writing of words, especially function names, different indentation, the use of "--" for commenting within a query, etc.) as antipatterns. Use your common sense and evaluate the effort against the benefits or risks and, not less important, be patient with others' mistakes!

Happy coding!

07 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Views and other Data Objects

One reads in the training material that the SQL Endpoint provides a read-only experience [1], meaning that no data can be written back to the delta lake tables. Playing with the metadata available in Spark SQL via Notebooks and the SQL Endpoint (see post), I realized that there is more to the statement! Even if one can query via the SQL Endpoint only delta tables, this doesn't mean that one can't build a semantic model on top of it, much like one was able to do via the Serverless SQL pool in Azure Synapse.

In Spark one can create via SQL, PySpark and the other supported languages views and functions, though they will not be available to the SQL Endpoint! To use the data generated in the process, the respective data needs to be saved to delta tables. Conversely, one can still create views, functions and stored procedures via the SQL Endpoint though the objects won't be available in Spark SQL! 

This has important implications, though in this post let's focus on the syntax and create several objects for testing purposes in the two environments. I'll use the Assets delta table created in a previous post. The Spark SQL code should be run in a notebook (e.g. one cell per group of statements), while the code for the SQL Endpoint should be run in SQL Server Management Studio.

Views

/* test view's creation in the SQL Endpoint */

-- drop the test view 
DROP VIEW IF EXISTS dbo.vAssets_Microsoft2;
GO

-- create the test view
CREATE VIEW dbo.vAssets_Microsoft2
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = 'Microsoft';
GO

-- test the viwe
SELECT *
FROM dbo.vAssets_Microsoft2;

/* test view's creation in Spark SQL */

-- drop test view 
DROP VIEW IF EXISTS vAssets_Microsoft;

-- create test view
CREATE VIEW vAssets_Microsoft COMMENT 'Microsoft assets in scope (view)'
AS
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM assets
WHERE Vendor = 'Microsoft';

-- review data
SELECT *
FROM vAssets_Microsoft;

Table-Valued Functions

/* test function's creation in the SQL Endpoint */

-- drop the test function 
DROP FUNCTION IF EXISTS dbo.fAssets_Microsoft2;
GO

-- create the test function
CREATE FUNCTION dbo.fAssets_Microsoft2(
    @Vendor nvarchar(max))
RETURNS TABLE 
AS 
RETURN (
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM dbo.Assets
    WHERE Vendor = @Vendor
);
GO

-- test the function
SELECT *
FROM dbo.fAssets_Microsoft2('Microsoft');

SELECT *
FROM dbo.fAssets_Microsoft2('Dell');

Unfortunately, the Spark SQL code doesn't seem to work, its execution returning a PARSE_SYNTAX_ERROR error no matter how simple the code was (see also [2]).
 
/* test function's creation in Spark SQL */

-- drop test function 
DROP FUNCTION IF EXISTS fAssets_Microsoft;

-- create test function
CREATE FUNCTION fAssets_Microsoft(
    pVendor string)
RETURNS TABLE
AS 
RETURN 
    SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
    FROM assets
    WHERE Vendor = pVendor;

-- review data
SELECT *
FROM fAssets_Microsoft('Microsoft');

Stored Procedure

Stored procedures aren't available in Spark SQL, though this doesn't mean that we can't test the code in the SQL Endpoint:

/* test procedure's creation in the SQL Endpoint */

-- drop the test procedure 
DROP PROCEDURE IF EXISTS dbo.spAssets_Microsoft2;
GO

-- create the test procedure
CREATE PROCEDURE dbo.spAssets_Microsoft2(
@Vendor nvarchar(max) = NULL)
AS
--Microsoft assets
SELECT Id, CreationDate, Vendor, Asset, Model, Owner, Tag, Quantity
FROM dbo.Assets
WHERE Vendor = IsNull(@Vendor, Vendor);
GO

-- test the procedure
EXEC dbo.spAssets_Microsoft2 'Microsoft';
EXEC dbo.spAssets_Microsoft2 'Dell';
EXEC dbo.spAssets_Microsoft2;

Notes:
1) I observed in documentation and some presentations that the common practice of prefixing data objects based on their type is seldom considered. I still find it useful when building solutions, even if object's type can be derived from the context and/or metadata. 
2) The examples were chosen to test the minimal functionality so that the differences between the two platforms are minimal - using the dbo schema and the GO command in the SQL Endpoint, COMMENT in Spark SQL. However, as soon specific functionality is used, extra code is needed to mitigate the differences.
3) The names between environments were kept different, just in case one needs to test objects' availability between platforms.

Happy coding!

Resources:
[1] Microsoft Learn (2023) Work with Delta Lake tables in Microsoft Fabric (link)
[2] Databricks (2023) CREATE FUNCTION (SQL and Python) (link)

06 February 2024

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

There seem to be four main sources for learning about the functionality available in the Delta Lake especially in what concerns the SQL dialect used by the Spark SQL: Databricks [1], Delta Lake [2], Azure Databricks [3], respectively the Data Engineering documentation in Microsoft Fabric [4] and the afferent certification material. Unfortunately, the latter focuses more on PySpark. So, until Microsoft addresses the gap, one can consult the other sources, check what's working and built thus the required knowledge for handling the various tasks. 

First of all, it's important to understand which the data objects available in Microsoft Fabric are. Based on [5] I could identify the following hierarchy:


According to the same source [5] the metastore contains all of the metadata that defines data objects in the lakehouse, while the catalog is the highest abstraction in the lakehouse. The database, called also a schema, keeps its standard definition - a collection of data objects, such as tables or views (aka relations) and functions. The tables, views and functions keep their standard definitions. Except the metastore, these are also the (securable) objects on which permissions can be set. 

One can explore the structure in Spark SQL by using the SHOW command:

-- explore the data objects from the lakehouse
SHOW CATALOGS;

SHOW DATABASES;

SHOW SCHEMAS;

SHOW CATALOGS;

SHOW VIEWS;

SHOW TABLES;

SHOW FUNCTIONS;

Moreover, one can list only the objects from an object from the parent (e.g. the tables existing in a database):
 
-- all tables from a database
SHOW TABLES FROM Testing;

-- all tables from a database matching a pattern
SHOW TABLES FROM Testing LIKE 'Asset*';

-- all tables from a database matching multiple patterns
SHOW TABLES FROM Testing LIKE 'Asset*|cit*';

Notes:
1) Same syntax applies for views and functions, respectively for the other objects in respect to their parents (from the hierarchy). 
2) Instead of FROM one can use the IN keyword, though I'm not sure what's the difference given that they seem to return same results.
3) For databases and tables one can use also the SQL Server to export the related metadata. Unfortunately, it's not the case for views and functions because when the respective objects are created in Spark SQL, they aren't visible over the SQL Endpoint, and vice versa.

4) Given that there are multiple environments that deal with delta tables, to minimize the confusion that might result from their use, it makes sense to use database as term instead of schema. 

References:
[1] Databricks (2024) SQL language reference (link)
[2] Delta Lake (2023) Delta Lake documentation (link)
[3] Microsoft Learn (2023) Azure Databricks documentation (link)
[4] Microsoft Learn (2023) Data Engineering documentation in Microsoft Fabric (link)
[5] Databricks (2023) Data objects in the Databricks lakehouse (link)

SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

While using the DESCRIBE TABLE metadata I observed that its output shown also a Comment attribute which was NULL for all the columns. Browsing through the Databricks documentation (see [1]), I found that the Comment can be provided in a delta table's definition as follows (code to be run in a notebook):

-- drop the table (if already exists)
DROP TABLE IF EXISTS Assets2;

--create the table
CREATE TABLE Assets2 (
 Id int NOT NULL COMMENT 'Asset UID',
 CreationDate timestamp NOT NULL COMMENT 'Creation Date',
 Vendor string NOT NULL COMMENT 'Vendors name',
 Asset string NOT NULL COMMENT 'Asset type',
 Model string NOT NULL COMMENT 'Asset model',
 Owner string NOT NULL COMMENT 'Current owner',
 Tag string NOT NULL COMMENT 'Assets tag',
 Quantity decimal(13, 2) NOT NULL COMMENT 'Quantity'
) 
USING DELTA
COMMENT 'Vendor assets';

-- show table's definition
DESCRIBE TABLE Assets2;

-- show table's details
DESCRIBE DETAIL Assets;

So, I thought there must be a way to update Assets table's definition as well. And here's the solution split into two steps, as different syntax is required for modifying the columns, respectively the table:

-- modify columns' COMMENT for an existing table
ALTER TABLE Assets ALTER COLUMN ID COMMENT 'Asset UID';
ALTER TABLE Assets ALTER COLUMN CreationDate COMMENT 'Creation Date';
ALTER TABLE Assets ALTER COLUMN Vendor COMMENT 'Vendors name';
ALTER TABLE Assets ALTER COLUMN Asset COMMENT 'Asset type';
ALTER TABLE Assets ALTER COLUMN Model COMMENT 'Asset model';
ALTER TABLE Assets ALTER COLUMN Owner COMMENT 'Current owner';
ALTER TABLE Assets ALTER COLUMN Tag COMMENT 'Assets tag';
ALTER TABLE Assets ALTER COLUMN Quantity COMMENT 'Quantity';

-- show table's definition
DESCRIBE TABLE Assets;

The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180621522,"operation":"CHANGE COLUMN","operationParameters":{"column":"{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}"},"readVersion":13,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"42590d18-e04a-4fb2-bbfa-94414b23fb07"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}},
{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}},
{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}},
{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}},
{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}},
{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}},
{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}},
{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

And the second step (see [2]):

-- modify a table's COMMENT
COMMENT ON TABLE Assets IS 'Vendor assets';

-- describe table's details
DESCRIBE DETAIL Assets;

    The operation generated a log file with the following content:

{"commitInfo":{"timestamp":1707180808138,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"comment\":\"Vendor assets\"}"},"readVersion":14,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"21c315b5-a81e-4107-8a19-6256baee7bd5"}}
{"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}}
,{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}}
,{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}}
,{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}}
,{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}}
,{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}}
,{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}}
,{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}

Notes:
1) Don't forget to remove the Assets2 table!
2) Updating the COMMENT for a single delta table is simple, though for updating the same for a list of tables might be task for a PySpark job. It makes sense to provide the respective metadata from the start, when that's possible. Anyway, the information should be available in lakehouse's data dictionary.
3) One can reset the COMMENT to NULL or to an empty string.

Happy coding!

Resources:
[1] Databaricks (2023) ALTER TABLE (link)
[2] Databaricks (2023) COMMENT ON (link)
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.