SQL Troubles: SQL

Showing posts with label SQL. Show all posts

24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)

Introduction

One way to work with the data files existing in organization is to import them into a lakehouse and build a data model based on them that can be reused in the various solutions (incl. Power BI). As a reminder, a lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location.

The tutorials available on lakehouses are pretty useful for getting an idea how to start, though probably each seasoned professional has his/her way of doing things, at least for testing the capabilities before building a proper solution. The target is thus, to create the minimum for testing the capabilities needed for a proof-of-concept solution.

The files used in this post are available on Microsoft's GitHub. Download the files and store them to be easily available for the next steps. The following files were considered for the current post: customers.csv, orders.csv and products.csv.

Create the Lakehouse

It's recommended to have a naming convention in place for the various items created in a workspace, e.g. a combination between item type (lakehouse, warehouse), system type (Prod, UAT, Dev, PoC) and eventually department (e.g. FIN, SCM, HR, etc.). One should try to balance between meaning and usefulness. Ideally, one should use 2 maximum 4 letters for each part encoded in the name. For example, the following scripts were created in the LH_SCM_PoC lakehouse.

More complex naming conventions can include the system (e.g. D365, CRM, EBS) or the company. The target is to easily identify the systems, independently how complex the rules are. Given that it can become challenging to rename the schemas afterwards, ideally the naming convention should be available from the early stages.

Create the Schema

A lakehouse comes with a dbo schema available by default, though it's recommended to create meaningful schema(s) as needed. The schemas should ideally reflect the domain of the data (e.g. departments or other key areas) and the schemas shouldn't change when the objects are deployed between the different environments. Upon case, one should consider creating multiple schemas that reflect the model's key areas. The names should be simple and suggestive.

-- create schema
CREATE Schema Orders

Create a Staging Area

The next step is to create a staging area where the files in scope can be made available and then further loaded in the lakehouse. One needs to compromise between creating a deep hierarchical structure that reflects the business structure and the need to easily identify, respectively manage the files. An hierarchical structure with 1-2 level could provide the needed compromise, though each additional level tends to increase the complexity.

One should also consider rules for archiving or even deleting the files.

Upload the Files

Microsoft Fabric allows users to upload multiple files together into a single step. Ideally the files should have proper names for each column, otherwise overheads deriving from this may appear later in the process.

When the files are available in multiple folders in a complex hierarchical structure, a set of shortcuts could help in their management.

Load the Data

A file's data can be loaded into the lakehouse on the fly by providing a valid table name:
Files >> SCM_Orders >> (select file) >> Load to Tables >> new table >> Load file to new table >> (provide information) >> Load

Load file to new table

Of course, the table's name must be unique within the Schema and the further properties must define files' definition.

One should consider loading first a couple of tables, performing a rough validation of the data imported, and only after that the remaining tables can be imported. This allows to identify the issues that typically lead to reimports of the data (wrong formatting, invalid column names, duplicated files, etc.) or rework.

If the files have different characteristics (e.g. delimiters, number of attributes/records, special data types), one should consider this in the initial scope and have at least one example from each category.

Review the Metadata

Once the files were made available, there's the tendency to start directly with the development without analyzing the data, or equally important, the metadata available. To review the metadata of the tables newly created, one can use the objects from the standard INFORMATION_SCHEMA (see post):

-- retrieve the list of tables
SELECT * 
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'orders'

ORDER BY TABLE_SCHEMA

Further on, one can review columns' definition:

-- retrieve column metadata
SELECT TABLE_CATALOG
, TABLE_SCHEMA
, TABLE_NAME
, COLUMN_NAME
, ORDINAL_POSITION
, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH
, NUMERIC_PRECISION
, NUMERIC_SCALE
, DATETIME_PRECISION
, CHARACTER_SET_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'orders'
ORDER BY ORDINAL_POSITION

It's a good idea to save the metadata to a file and use it later for reviews, respectively for metadata management, when no other solution is in place for the same (e.g. Purview). That's useful also for the people with limited or no access to the workspace.

Alternatively, one can use a notebook with the following SQL-based commands:

%%sql

DESCRIBE TABLE LH_SCM_PoC.orders.sales;

DESCRIBE TABLE EXTENDED LH_SCM_PoC.orders.sales;

One can even provide meaningful descriptions for each table and its attributes via scripts like the ones below, however it might be a good idea to do this in the later phases of the PoC, when the logic become stable:

%%sql

-- modify a table's COMMENT
COMMENT ON TABLE LH_SCM_PoC.orders.sales IS 'Customer sales orders';

-- modify columns' COMMENT for an existing table
ALTER TABLE LH_SCM_DWH.orders.sales  
ALTER COLUMN SalesOrderNumber COMMENT 'Sales Order Number';

Data Validation

Before diving into building any business logic, besides identifying the primary, foreign keys and further attributes used in bringing the data together, it's recommended to get an overview of data's intrinsic and extrinsic characteristics relevant to the analysis. Some of the rules used typically for studying the quality of data apply to some extent also in here, though one needs to prioritize accordingly, otherwise one replicates the effort that's typically part of the Data Management initiatives.

In addition, it's important to check how much the identified issues impact the business logic, respectively on whether the issues can be corrected to match the expectations. Often, no logic can compensate for major data quality issues, and this can also affect PoC's results as soon as the outcomes are validated against the expectations!

Data Understanding

Further on, it makes sense to get a high-level understanding of the data by looking at the distribution of values, respectively at the records participating in the joins. Of course, more similar queries can be built, though again, one should try to focus on the most important aspects!

The analysis could for example consider the following points:

/* validation of Products */

-- review duplicated product numbers (should be 0)
SELECT ProductName
, count(*) RecordCount
FROM orders.products
GROUP BY ProductName
HAVING count(*)>1

-- review most (in)expensive products
SELECT top 100 ProductID
, ProductName
, Category
, ListPrice 
FROM orders.products
ORDER BY ListPrice DESC --ASC

-- review category distribution
SELECT Category
, count(*) RecordCount 
FROM orders.products
GROUP BY Category
ORDER BY RecordCount DESC

-- review price ranges (
SELECT Len(floor(ListPrice)) RangeCount
, count(*) RecordCount 
FROM orders.products
GROUP BY Len(floor(ListPrice)) 
ORDER BY RangeCount DESC

/* validation of Customers */

-- duplicated email address 
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT EmailAddress
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY EmailAddress 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.EmailAddress = CST.EmailAddress
ORDER BY DUP.RecordCount DESC
, DUP.EmailAddress 

-- duplicated Customer names (not necessarily duplicates)
SELECT CST.CustomerID
, CST.FirstName
, CST.LastName 
, CST.EmailAddress 
, DUP.RecordCount
FROM (-- duplicates
	SELECT FirstName
	, LastName
	, count(*) RecordCount 
	FROM orders.customers 
	GROUP BY FirstName
	, LastName 
	HAVING count(*)>1
	) DUP
	JOIN orders.customers CST
	   ON DUP.FirstName = CST.FirstName
      AND DUP.LastName = CST.LastName
ORDER BY DUP.RecordCount DESC
, DUP.FirstName
, DUP.LastName

/* validation of Orders */

-- review a typical order
SELECT SalesOrderID
, OrderDate
, CustomerID
, LineItem
, ProductID
, OrderQty
, LineItemTotal
FROM orders.orders
WHERE SalesOrderID = 71780
ORDER BY SalesOrderID 
, LineItem

-- review orders' distribution by month
SELECT Year(OrderDate) Year
, Month(OrderDate) Month
, count(*) RecordCount
FROM orders.orders
GROUP BY Year(OrderDate) 
, Month(OrderDate) 
ORDER BY Year
, Month

-- checking for duplicates
SELECT SalesOrderID
, LineItem
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
, LineItem
HAVING count(*)>1

-- checking for biggest orders
SELECT SalesOrderID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY SalesOrderID
HAVING count(*) > 10
ORDER BY NoRecords DESC

-- checking for most purchased products
SELECT ProductID
, count(*) NoRecords
FROM orders.orders ord 
GROUP BY ProductID
HAVING count(*) > 8
ORDER BY NoRecords DESC

-- checking for most active customers
SELECT CustomerID
, count(*) RecordCount
FROM orders.orders ord 
GROUP BY CustomerID
HAVING count(*) > 10
ORDER BY RecordCount DESC

/* join checks */

-- Prders without Product (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ord 
	 LEFT JOIN orders.products prd
	   ON ord.ProductID = prd.ProductID
WHERE prd.ProductID IS NULL

-- Prders without Customer (should be 0)
SELECT count(*) RecordCount
FROM orders.orders ORD 
	 LEFT JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID
WHERE CST.CustomerID IS NULL

-- Products without Orders (153 records)
SELECT count(*) RecordCount
FROM orders.products prd
	 LEFT JOIN orders.orders ord 
	   ON prd.ProductID = ord.ProductID 
WHERE ord.ProductID IS NULL


-- Customers without Orders (815 records)
SELECT count(*) RecordCount
FROM orders.customers CST
	 LEFT JOIN orders.orders ORD
	   ON ORD.CustomerID = CST.CustomerID
WHERE ORD.CustomerID IS NULL

The more tables are involved, the more complex the validation logic can become. One should focus on the most important aspects.

Building the Logic

Once one has an acceptable understanding of the data entities involved and the relation between them, it's time to build the needed business logic by joining the various tables at the various levels of detail. One can focus on the minimum required, respectively attempt to build a general model that can address a broader set of requirements. For the PoC it's usually recommended to start small by addressing the immediate requirements, though some flexibility might be needed for exploring the data and preparing the logic for a broader set of requirements. Independently of the scope, one should consider a set of validations.

Usually, it makes sense to encapsulate the logic in several views or table-valued functions that reflect the logic for the main purposes and which allow a high degree of reuse (see [1]). Of course, one can use the standard approach for modelling the bronze, silver, respectively the gold layers adopted by many professionals. For a PoC, even if that's not mandatory, it might still be a good idea to make steps in the respective direction.

In this case, dealing with only three tables - a fact table and two dimensions table - there are several perspectives that can be built:

a) all records from fact table + dimension records

The following view provides the lowest level of details for the fact table, allowing thus to look at the data from different perspectives as long as focus is only the values used is Sales Orders:

-- create the view
CREATE OR ALTER VIEW orders.vSalesOrders
-- Sales Orders with Product & Customer information
AS
SELECT ORD.SalesOrderID
, ORD.OrderDate
, ORD.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.LineItem
, ORD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.OrderQty
, ORD.LineItemTotal
, PRD.ListPrice 
, ORD.OrderQty * PRD.ListPrice ListPriceTotal
FROM orders.orders ORD 
	 JOIN orders.products PRD
	   ON ORD.ProductID = PRD.ProductID
	 JOIN orders.customers CST
	   ON ORD.CustomerID = CST.CustomerID

-- test the view   
SELECT *
FROM orders.vSalesOrders
WHERE SalesOrderID = 71780

One can use full joins unless some of the references dimensions are not available.

b) aggregated data for all dimension combinations

The previous view allows to aggregate the data at the various levels of details:

-- Sales volume by Customer & Product
SELECT ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
, SUM(ORD.OrderQty) OrderQty
, SUM(ORD.LineItemTotal) LineItemTotal
FROM orders.vSalesOrders ORD 
WHERE ORD.OrderDate >= '2022-06-01'
  AND ORD.OrderDate < '2022-07-01'
GROUP BY ORD.EmailAddress
, ORD.ProductName 
, ORD.Category
ORDER BY ORD.EmailAddress
, ORD.ProductName

One can comment out the dimensions not needed. The query can be included in a view as well.

c) all records from each dimension table + aggregated fact records

Sometimes, it's useful to look at the data from a dimension's perspective, though it might be needed to create such an object for each dimension, like in the below examples. For the maximum of flexibility the logic can be included in a table-valued function:

-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfProductsSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Product
AS
RETURN (
SELECT PRD.ProductID
, PRD.ProductName 
, PRD.Category
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesQty, 0) TotalSalesQty 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.products PRD
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.OrderQty) TotalSalesQty
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.ProductID = PRD.ProductID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfProductsSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC


-- create the user-defined function
CREATE OR ALTER FUNCTION orders.tvfCustomersSalesVolume(
    @StartDate date NULL,
    @EndDate date NULL
)
RETURNS TABLE
-- Sales volume by Customer
AS
RETURN (
SELECT CST.CustomerID
, CST.FirstName 
, CST.LastName
, CST.EmailAddress
, ORD.FirstOrderDate
, ORD.LastOrderDate 
, IsNull(ORD.TotalSalesValue, 0) TotalSalesValue
, IsNull(ORD.OrderCount, 0) OrderCount
, IsNull(ORD.LineCount, 0) LineCount
FROM orders.customers CST
     OUTER APPLY (
		SELECT Min(ORD.OrderDate) FirstOrderDate
		, Max(ORD.OrderDate) LastOrderDate 
		, SUM(ORD.LineItemTotal) TotalSalesValue
		, count(DISTINCT SalesOrderID) OrderCount
		, count(*) LineCount
		FROM orders.orders ORD 
		WHERE ORD.CustomerID = CST.CustomerID
		  AND ORD.OrderDate >= @StartDate 
		  AND ORD.OrderDate < @EndDate 
	 ) ORD
);

-- test the user-defined function
SELECT *
FROM orders.tvfCustomersSalesVolume('2022-06-01','2022-07-01') PRD
WHERE TotalSalesValue <> 0
ORDER BY TotalSalesValue DESC
, LastOrderDate DESC

When restructuring the queries in similar ways, there's always a compromise between the various factors: (re)usability, performance or completeness.

Further Comments

The above database objects should allow users to address most of the requirements, though, as usual, there can be also exceptions, especially when the data needs to be aggregated at a different level of detail that requires the query to be structured differently.

The number of perspectives can increase also with the number of fact tables used to model a certain entity (e.g. Sales order headers vs. lines). For example,

In theory, one can also find ways to automate the process of creating database objects, though one must choose the relevant attributes, respectively include logic that makes sense only within a certain perspective.

No matter the data, respectively systems used as source, expect surprises and test your assumptions! For example, in the file used to create the orders.customers table, there seem to be duplicated entities with the same name and email address. One must clarify how such entities must be handled in data analysis, respectively in data modeling. For example, a person can appear twice because of the roles associated with the name or can be other entitled reasons.

The files in scope of this post are small compared with the files existing in organizations. In many scenarios files' size could range from GB to TB and thus require partitioning and different other strategies.

|>> Next Post

References
[1] sql-troubles (2023) Architecture Part IV: Building a Modern Data Warehouse with Azure Synapse [link]

Resources
[1] Microsoft Learn (2024) Fabric: Lakehouse and Delta Lake tables [link]

16 March 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part XI: Database and Server Properties)

When taking over a SQL Server, respectively database, one of the first checks I do focuses on the overall configuration, going through the UI available for admins to see if I can find anything that requires further investigation. If no documentation is available on the same, I run a few scripts and export their output as baseline.

Especially when documenting the configuration, it's useful to export the database options and properties defined at database level. Besides the collation and probably the recovery mode, typically the rest of the configuration is similar, though in exceptional cases one should expect also surprises that require further investigation!

The following query retrieves in a consolidated way all the options and properties of a SQL database in Microsoft Fabric.

-- database settings/properties 
SELECT DATABASEPROPERTYEX(DB_NAME(), 'Collation') Collation
--, DATABASEPROPERTYEX(DB_NAME(), 'ComparisonStyle')  ComparisonStyle
, DATABASEPROPERTYEX(DB_NAME(), 'Edition') Edition
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAnsiNullDefault') IsAnsiNullDefault
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAnsiNullsEnabled') IsAnsiNullsEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAnsiPaddingEnabled') IsAnsiPaddingEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAnsiWarningsEnabled') IsAnsiWarningsEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsArithmeticAbortEnabled') IsArithmeticAbortEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAutoClose') IsAutoClose
, DATABASEPROPERTYEX(DB_NAME(), 'IsAutoCreateStatistics') IsAutoCreateStatistics
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAutoCreateStatisticsIncremental') IsAutoCreateStatisticsIncremental
--, DATABASEPROPERTYEX(DB_NAME(), 'IsAutoShrink') IsAutoShrink
, DATABASEPROPERTYEX(DB_NAME(), 'IsAutoUpdateStatistics') IsAutoUpdateStatistics
--, DATABASEPROPERTYEX(DB_NAME(), 'IsClone') IsClone
--, DATABASEPROPERTYEX(DB_NAME(), 'IsCloseCursorsOnCommitEnabled') IsCloseCursorsOnCommitEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsDatabaseSuspendedForSnapshotBackup') IsDatabaseSuspendedForSnapshotBackup
, DATABASEPROPERTYEX(DB_NAME(), 'IsFulltextEnabled') IsFulltextEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsInStandBy') IsInStandBy
--, DATABASEPROPERTYEX(DB_NAME(), 'IsLocalCursorsDefault') IsLocalCursorsDefault
--, DATABASEPROPERTYEX(DB_NAME(), 'IsMemoryOptimizedElevateToSnapshotEnabled') IsMemoryOptimizedElevateToSnapshotEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsMergePublished') IsMergePublished
--, DATABASEPROPERTYEX(DB_NAME(), 'IsNullConcat') IsNullConcat
--, DATABASEPROPERTYEX(DB_NAME(), 'IsNumericRoundAbortEnabled') IsNumericRoundAbortEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsParameterizationForced') IsParameterizationForced
--, DATABASEPROPERTYEX(DB_NAME(), 'IsQuotedIdentifiersEnabled') IsQuotedIdentifiersEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsPublished') IsPublished
--, DATABASEPROPERTYEX(DB_NAME(), 'IsRecursiveTriggersEnabled') IsRecursiveTriggersEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsSubscribed') IsSubscribed
--, DATABASEPROPERTYEX(DB_NAME(), 'IsSyncWithBackup') IsSyncWithBackup
--, DATABASEPROPERTYEX(DB_NAME(), 'IsTornPageDetectionEnabled') IsTornPageDetectionEnabled
--, DATABASEPROPERTYEX(DB_NAME(), 'IsVerifiedClone') IsVerifiedClone
--, DATABASEPROPERTYEX(DB_NAME(), 'IsXTPSupported') IsXTPSupported
, DATABASEPROPERTYEX(DB_NAME(), 'LastGoodCheckDbTime') LastGoodCheckDbTime
, DATABASEPROPERTYEX(DB_NAME(), 'LCID') LCID
--, DATABASEPROPERTYEX(DB_NAME(), 'MaxSizeInBytes') MaxSizeInBytes
, DATABASEPROPERTYEX(DB_NAME(), 'Recovery') Recovery
--, DATABASEPROPERTYEX(DB_NAME(), 'ServiceObjective') ServiceObjective
--, DATABASEPROPERTYEX(DB_NAME(), 'ServiceObjectiveId') ServiceObjectiveId
, DATABASEPROPERTYEX(DB_NAME(), 'SQLSortOrder') SQLSortOrder
, DATABASEPROPERTYEX(DB_NAME(), 'Status') Status
, DATABASEPROPERTYEX(DB_NAME(), 'Updateability') Updateability
, DATABASEPROPERTYEX(DB_NAME(), 'UserAccess') UserAccess
, DATABASEPROPERTYEX(DB_NAME(), 'Version') Version
--, DATABASEPROPERTYEX(DB_NAME(), 'ReplicaID') ReplicaID

Output:

Collation

Edition

IsAutoCreateStatistics

IsAutoUpdateStatistics

IsFulltextEnabled

LastGoodCheckDbTime

LCID

Recovery

SQLSortOrder

Status

Updateability

UserAccess

Version

SQL_Latin1_General_CP1_CI_AS

FabricSQLDB

12/31/1899

1033

FULL

ONLINE

READ_WRITE

MULTI_USER

981

The query can be run also against the SQL analytics endpoints available for warehouses in Microsoft Fabric.

Output:

Collation

Edition

IsAutoCreateStatistics

IsAutoUpdateStatistics

IsFulltextEnabled

LastGoodCheckDbTime

LCID

Recovery

SQLSortOrder

Status

Updateability

UserAccess

Version

Latin1_General_100_BIN2_UTF8

DataWarehouse

12/31/1899

1033

SIMPLE

ONLINE

READ_WRITE

MULTI_USER

987

Respectively, for lakehouses:

Collation

Edition

IsAutoCreateStatistics

IsAutoUpdateStatistics

IsFulltextEnabled

LastGoodCheckDbTime

LCID

Recovery

SQLSortOrder

Status

Updateability

UserAccess

Version

Latin1_General_100_BIN2_UTF8

LakeWarehouse

12/31/1899

1033

SIMPLE

ONLINE

READ_WRITE

MULTI_USER

987

A similar output is obtained if one runs the query against SQL database's SQL analytics endpoint:

Output:

Collation

Edition

IsAutoCreateStatistics

IsAutoUpdateStatistics

IsFulltextEnabled

LastGoodCheckDbTime

LCID

Recovery

SQLSortOrder

Status

Updateability

UserAccess

Version

Latin1_General_100_BIN2_UTF8

LakeWarehouse

12/31/1899

1033

SIMPLE

ONLINE

READ_WRITE

MULTI_USER

987

SQL databases seem to inherit the collation from the earlier versions of SQL Server.

Another meaningful value for SQL databases is MaxSizeInBytes, which in my environment had a value of 3298534883328 bytes ÷ 1,073,741,824 = 3,072 GB.

There are however also server properties. Here's the consolidated overview:

-- server properties
SELECT --SERVERPROPERTY('BuildClrVersion') BuildClrVersion
 SERVERPROPERTY('Collation') Collation
--, SERVERPROPERTY('CollationID') CollationID
, SERVERPROPERTY('ComparisonStyle') ComparisonStyle
--, SERVERPROPERTY('ComputerNamePhysicalNetBIOS') ComputerNamePhysicalNetBIOS
, SERVERPROPERTY('Edition') Edition
--, SERVERPROPERTY('EditionID') EditionID
, SERVERPROPERTY('EngineEdition') EngineEdition
--, SERVERPROPERTY('FilestreamConfiguredLevel') FilestreamConfiguredLevel
--, SERVERPROPERTY('FilestreamEffectiveLevel') FilestreamEffectiveLevel
--, SERVERPROPERTY('FilestreamShareName') FilestreamShareName
--, SERVERPROPERTY('HadrManagerStatus') HadrManagerStatus
--, SERVERPROPERTY('InstanceDefaultBackupPath') InstanceDefaultBackupPath
, SERVERPROPERTY('InstanceDefaultDataPath') InstanceDefaultDataPath
--, SERVERPROPERTY('InstanceDefaultLogPath') InstanceDefaultLogPath
--, SERVERPROPERTY('InstanceName') InstanceName
, SERVERPROPERTY('IsAdvancedAnalyticsInstalled') IsAdvancedAnalyticsInstalled
--, SERVERPROPERTY('IsBigDataCluster') IsBigDataCluster
--, SERVERPROPERTY('IsClustered') IsClustered
, SERVERPROPERTY('IsExternalAuthenticationOnly') IsExternalAuthenticationOnly
, SERVERPROPERTY('IsExternalGovernanceEnabled') IsExternalGovernanceEnabled
, SERVERPROPERTY('IsFullTextInstalled') IsFullTextInstalled
--, SERVERPROPERTY('IsHadrEnabled') IsHadrEnabled
--, SERVERPROPERTY('IsIntegratedSecurityOnly') IsIntegratedSecurityOnly
--, SERVERPROPERTY('IsLocalDB') IsLocalDB
--, SERVERPROPERTY('IsPolyBaseInstalled') IsPolyBaseInstalled
--, SERVERPROPERTY('IsServerSuspendedForSnapshotBackup') IsServerSuspendedForSnapshotBackup
--, SERVERPROPERTY('IsSingleUser') IsSingleUser
--, SERVERPROPERTY('IsTempDbMetadataMemoryOptimized') IsTempDbMetadataMemoryOptimized
, SERVERPROPERTY('IsXTPSupported') IsXTPSupported
, SERVERPROPERTY('LCID') LCID
, SERVERPROPERTY('LicenseType') LicenseType
, SERVERPROPERTY('MachineName') MachineName
, SERVERPROPERTY('NumLicenses') NumLicenses
, SERVERPROPERTY('PathSeparator') PathSeparator
--, SERVERPROPERTY('ProcessID') ProcessID
, SERVERPROPERTY('ProductBuild') ProductBuild
--, SERVERPROPERTY('ProductBuildType') ProductBuildType
--, SERVERPROPERTY('ProductLevel') ProductLevel
--, SERVERPROPERTY('ProductMajorVersion') ProductMajorVersion
--, SERVERPROPERTY('ProductMinorVersion') ProductMinorVersion
--, SERVERPROPERTY('ProductUpdateLevel') ProductUpdateLevel
--, SERVERPROPERTY('ProductUpdateReference') ProductUpdateReference
--, SERVERPROPERTY('ProductUpdateType') ProductUpdateType
, SERVERPROPERTY('ProductVersion') ProductVersion
, SERVERPROPERTY('ResourceLastUpdateDateTime') ResourceLastUpdateDateTime
, SERVERPROPERTY('ResourceVersion') ResourceVersion
, SERVERPROPERTY('ServerName') ServerName
, SERVERPROPERTY('SqlCharSet') SqlCharSet
, SERVERPROPERTY('SqlCharSetName') SqlCharSetName
, SERVERPROPERTY('SqlSortOrder') SqlSortOrder
, SERVERPROPERTY('SqlSortOrderName') SqlSortOrderName
, SERVERPROPERTY('SuspendedDatabaseCount') SuspendedDatabaseCount

Output (consolidated):

Property	SQL database	Warehouse	Lakehouse
Collation	SQL_Latin1_General_CP1_CI_AS	SQL_Latin1_General_CP1_CI_AS	SQL_Latin1_General_CP1_CI_AS
ComparisonStyle	196609	196609	196609
Edition	SQL Azure	SQL Azure	SQL Azure
EngineEdition	12	11	11
InstanceDefaultDataPath	NULL	NULL	NULL
IsAdvancedAnalyticsInstalled	1	1	1
IsExternalAuthenticationOnly	1	0	0
IsExternalGovernanceEnabled	1	1	1
IsFullTextInstalled	1	0	0
IsXTPSupported	1	1	1
LCID	1033	1033	1033
LicenseType	DISABLED	DISABLED	DISABLED
MachineName	NULL	NULL	NULL
NumLicenses	NULL	NULL	NULL
PathSeparator	\	\	\
ProductBuild	2000	502	502
ProductVersion	12.0.2000.8	12.0.2000.8	12.0.2000.8
ResourceLastUpdateDateTime	11/6/2024 3:41:27 PM	3/5/2025 12:05:50 PM	3/5/2025 12:05:50 PM
ResourceVersion	16.00.5751	17.00.502	17.00.502
ServerName	...	....	....
SqlCharSet	1	1	1
SqlCharSetName	iso_1	iso_1	iso_1
SqlSortOrder	52	52	52
SqlSortOrderName	nocase_iso	nocase_iso	nocase_iso
SuspendedDatabaseCount	NULL	0	0

It's interesting that all three instances have the same general collation, while the Engine Edition of SQL databases is not compatible with the others [2]. The Server Names has been removed manually from the output from obvious reasons. The warehouse and lakehouse are in the same environment (SQL Azure instance, see sys.databases), and therefore the same values are shown (though this might happen independently of the environments used).

The queries were run in a trial Microsoft Fabric environment. Other environments can have upon case different properties. Just replace the "--" from the commented code to get a complete overview.

The queries should run also in the other editions of SQL Server. If DATABASEPROPERTYEX is not supported, one should try DATABASEPROPERTY instead.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server 2022: DATABASEPROPERTYEX (Transact-SQL) [link]
[2] Microsoft Learn (2024) SQL Server 2022: SERVERPROPERTY (Transact-SQL) [link]

06 March 2025

💎🤖SQL Reloaded: Copilot Stories (Part IX: Working with the Stuff Function)

One of the underused string functions in SQL Server is Stuff, which returns part of a character, binary, text, or image expression. This post evaluates the answers provided by Microsoft 365 Copilot in the use of the respective function.

Disclaimer: The below code was generated by Copilot via prompts and is used for demonstration purposes!

Prompt:
Give me 10 scenarios with examples in which the Stuff function can be used in SQL Server.

Output:

/* Example 1: Inserting a Substring */

-- Insert a substring into a specific position within a string
SELECT STUFF('Hello World', 6, 0, 'SQL ') AS Result; -- Result: 'Hello SQL World'

/* Example 2: Replacing a Substring */

-- Replace a part of a string with another substring.
SELECT STUFF('Hello World', 7, 5, 'SQL Server') AS Result; -- Result: 'Hello SQL Server'

/* Example 3: Removing Characters */

-- Remove a specific number of characters from a string
SELECT STUFF('Hello World', 6, 5, '') AS Result; -- Result: 'Hello'

/* Example 4: Masking Sensitive Data */

-- Mask a credit card number
SELECT STUFF('1234-5678-9012-3456', 6, 9, '****-****') AS MaskedCard; -- Result: '1234-****-****-3456'

/* Example 5: Formatting Phone Numbers */

-- insert formatting characters into a phone number.
SELECT STUFF(STUFF('1234567890', 4, 0, '-'), 8, 0, '-') AS FormattedPhone; -- Result: '123-456-7890'

/* Example 6: Creating Initials */

-- extract initials from a full name.
SELECT STUFF(STUFF('John Doe', 2, 3, '.'), 5, 3, '.') AS Initials; -- Result: 'J.D.'

/* Example 7: Dynamic SQL Generation */

-- modify parts of a dynamic SQL query.
DECLARE @sql NVARCHAR(MAX) = 'SELECT * FROM Table WHERE Column = Value';
SELECT STUFF(@sql, CHARINDEX('Value', @sql), LEN('Value'), 'NewValue') AS ModifiedSQL; -- Result: 'SELECT * FROM Table WHERE Column = NewValue'

/* Example 8: URL Manipulation */

-- Insert or replace parts of a URL.
SELECT STUFF('http://example.com/page', 8, 0, 'www.') AS ModifiedURL; -- Result: 'http://www.example.com/page'

/* Example 9: Handling CSV Data */

-- Modify parts of a CSV string
SELECT STUFF('1,2,3,4,5', 3, 1, '10') AS ModifiedCSV; -- Result: '1,10,3,4,5'

/* Example 10: Custom String Formatting */ 

-- Insert custom formatting into a string
SELECT STUFF('20250305', 5, 0, '-') AS FormattedDate; -- Result: '2025-0305'

The first 6 examples are typical usages of the Stuff function while the other might raise the eyebrows of some developers.

Example 1: The example is not entirely correct (counting the characters starts at 1).

Example 2: In exchange, this example is correct.

Example 3: The example is unfortunately incorrect (same issue like in the first example). The technique can be useful when only one contiguous region is to be replaced once. Otherwise, one should use the Replace function.

Example 4: Starting with SQL Server 2016 one can use dynamic masking for similar purposes, which should provide more flexibility in many cases.

Example 5: Formatting phone number can be more complex than this because there can be different formats, country codes or any other special formats.

Example 6: Names can be quite complex, each part having different size.

Example 7: The example might work in certain cases, though it's enough for a field to contain the keyword "value" for the whole logic to generate errors.

Example 8: The example is valid, though web addresses can use also other transfer protocols (ftp, ftps, sftp, scp, etc.).

Example 9: A more reliable method is splitting the string in parts and replacing the value from the nth position then inserting a text when the length of the values is not known.

Example 10: Even if the technique is valid, the example isn't entirely valid even if it might be met in practice. See 20th example below for a better approach.

Prompt:
Give me 10 more scenarios with examples in which the Stuff function can be used in SQL Server.

Output:

/* Example 11: Reversing a string */

-- Reverse a string by replacing characters in a specific order.
SELECT STUFF(STUFF(STUFF('abcde', 1, 1, 'e'), 2, 1, 'd'), 3, 1, 'c') AS ReversedString; -- Result: 'edcba'

/* Example 12: Generating unique identifiers */

-- Modify parts of a GUID to create a custom identifier.
SELECT STUFF('123e4567-e89b-12d3-a456-426614174000', 1, 8, 'abcdef12') AS CustomGUID;-- Result: 'abcdef12-e89b-12d3-a456-426614174000'

/* Example 13: Combining columns */

-- Combine multiple columns into a single formatted string.
SELECT STUFF(FirstName + ' ' + LastName, LEN(FirstName) + 1, 0, ', ') AS FullName
FROM Employees;-- Result: 'John, Doe'

/* Example 14: Trimming strings */

-- Remove leading or trailing characters from a string.
SELECT STUFF('   Hello World   ', 1, 3, '') AS TrimmedString;-- Result: 'Hello World   '

/* Example 15: Creating abbreviations */

-- Generate abbreviations from a full string.
SELECT STUFF('North Atlantic Treaty Organization', 6, 22, 'ATO') AS Abbreviation;-- Result: 'NATO'

/* Example 16: Modifying JSON Data */

-- Insert or replace parts of a JSON string 
SELECT STUFF('{"name":"John","age":30}', 9, 4, 'Doe') AS ModifiedJSON;-- Result: '{"name":"Doe","age":30}'

/* Example 17: Handling XML Data */

-- Modify parts of an XML string.
SELECT STUFF('<name>John</name>', 7, 4, 'Doe') AS ModifiedXML;-- Result: '<name>Doe</name>'

/* Example 18: Generating random strings */

-- Insert random characters into a string.
SELECT STUFF('abcdef', 3, 0, CHAR(65 + ABS(CHECKSUM(NEWID())) % 26)) AS RandomString;-- Result: 'abXcdef' (X is a random character)

/* Example 19: Creating custom error messages  */

-- Insert dynamic values into error messages.
DECLARE @ErrorMessage NVARCHAR(100) = 'Error: Value {0} is not valid.';
SELECT STUFF(@ErrorMessage, CHARINDEX('{0}', @ErrorMessage), 3, '123') AS CustomErrorMessage;-- Result: 'Error: Value 123 is not valid.'

/* Example 20: Formatting dates */

-- Insert separators into a date string.
SELECT STUFF(STUFF('20250305', 5, 0, '-'), 8, 0, '-') AS FormattedDate; -- Result: '2025-03-05'

Example 11: Copilot got the use partially wrong given that each character except the pivoting character needs to be replaced accordingly. So for 6 characters one needs to replace the Stuff function 5 times!

Example 12: Implementing custom GUID is a process more complex than this as one needs to take care of not generating duplicates.

Example 13: This is an example on how to handle changes dynamically.

Example 14: One should use the Trim function whenever possible, respectively the combination LTrim and RTrim, if Trim is not available (it was introduced in SQL 2017).

Example 15: The example is incorrect. One should consider the length of the string in the formula.

Example 16: One must know the values in advance, otherwise the example doesn't hold. Moreover, the same issue like in the first example occurs.

Example 17: The example in not dynamic.

Example 18: It's an interesting technique for generating "random" characters given that unique values are generated across a dataset.

Example 19: One can write error messages with multiple placeholders, though the Replace function is simpler to use.

Example 20: It's easier to cast the value as datetime and apply the required formatting accordingly. Not testing whether the value is a date can lead to curious results.

I met some of the usages exemplified above, though I used the Stuff function seldom (see a previous post), when no other functions were available. Frankly, Copilot could prove to be a useful tool for learning SQL or other programming language in similar ways.

Happy coding!

Previous Post <<||>> Next Post

SQL Troubles

Pages

24 April 2025

💎🏭SQL Reloaded: Microsoft Fabric's Lakehouses at Work (Part I: Proof-of-Concept)

16 March 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part XI: Database and Server Properties)

06 March 2025

💎🤖SQL Reloaded: Copilot Stories (Part IX: Working with the Stuff Function)

About Me