Showing posts with label UDFs. Show all posts
Showing posts with label UDFs. Show all posts

21 October 2023

🧊Data Warehousing: Architecture V (Dynamics 365, the Data Lakehouse and the Medallion Architecture)

Data Warehousing
Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well! 

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy. 

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches. 

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

01 February 2023

💎SQL Reloaded: Alternatives for Better Code Maintainability in SQL Server & Azure Synapse I

Introduction

Queries can become quite complex and with the increased complexity they'll be harder to read and/or maintain. Since the early days of SQL Server, views and table-valued user-defined functions (UDFs) are the standard ways of encapsulating logic for reuse in queries, allowing to minimize the duplication of logic. Then came the common table expressions (CTEs), which allow a further layer of structuring the code, independently whether a view or UDF was used. 

These are the main 3 options that can be combined in various ways to better structure the code. On the other side, also a temporary table or table variable could be used for the same purpose, though they have further implications.

To exemplify the various approaches, let's consider a simple query based on two tables from the AdventureWorks database. For the sake of simplicity, further business rules have been left out.

Inline Subqueries

-- products with open purchase orders
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN ( -- cumulated open purchase orders by product
		SELECT POL.ProductId 
		, SUM(POL.OrderQty) PurchQty
		FROM Purchasing.PurchaseOrderDetail POL
		WHERE OrderQty - (ReceivedQty - RejectedQty)>0
		GROUP BY POL.ProductId 
	) POL
	ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

As can be seen, the logic for the "Open purchase orders" result set is built within an inline subquery (aka inline view). As its logic becomes more complex, the simplest way to handle this is to move it into a CTE.

Common Table Expressions (CTEs)

A common table expression can be thought of as a temporary result set defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement [1]. Thus, the CTE can't be reused between queries.

The inline query is moved at the beginning within a WITH statement to which is given a proper name that allows easier identification later:

-- products with open purchase orders (common table expression)
WITH OpenPOs
AS (-- cumulated open purchase orders by product
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	GROUP BY POL.ProductId 
)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Thus, this allows us to rewrite the JOIN as if it were between two tables. Multiple CTEs can be used as well, with or without any dependencies between them. Moreover, CTEs allow building recursive queries (see example).

There is no performance gain or loss by using a CTE. It's important to know that the result set is not cached, therefore, if the same CTE is called multiple times (within a query), it will be also "executed" for the same number of times. Except the cases in which the database engine uses a spool operator to save intermediate query results for a CTE, there will be created no work table in tempdb for CTEs.

If the inline query needs to be reused in several queries, defining a view is a better alternative.

Views

A view is a database object used to encapsulate a query and that can be referenced from other queries much like a table. In fact, it's also referred as a "virtual table". A view can't be execute by itself (as stored procedures do. No data, only the definition of the view is stored, and the various actions that can be performed on database objects can be performed on views as well.

-- creating the view
CREATE VIEW dbo.vOpenPurchaseOrders
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- testing the view
SELECT top 10 *
FROM dbo.vOpenPurchaseOrders

Once the view is created, it can be called from any query:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.vOpenPurchaseOrders POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Besides the schema binding, there are no additional costs for using views. However, views have several limitations (see [2]). Moreover, it's not possible to use parameters with views, scenarios in which tabled-valued UDFs can help.

Indexed Views 

Starting with SQL Server 2015, it's possible to materialize the data in a view, storing the results of the view in a clustered index on the disk in same way a table with a clustered index is stored. This type of view is called an indexed view (aka materialized view, though the concept is used slightly different in Azure Synapse) and for long-running queries can provide considerable performance gains. In case the view contains a GROUP BY is present, its definition must contain COUNT_BIG(*) and must not contain HAVING.

-- dropping the view
--DROP VIEW IF EXISTS Purchasing.vOpenPOs

-- create view
CREATE VIEW Purchasing.vOpenPOs
WITH SCHEMABINDING
AS
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
, COUNT_BIG(*) Count
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 
GO

--Create an index on the view.
CREATE UNIQUE CLUSTERED INDEX IDX_vOpenPOs
   ON Purchasing.vOpenPOs (ProductId);

--testing the view
SELECT top 100 *
FROM Purchasing.vOpenPOs

-- products with open purchase orders (indexed view)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN Purchasing.vOpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

When an indexed view is defined on a table, the query optimizer may use it to speed up the query execution even if it wasn't referenced in the query. Besides the restriction of the view to be deterministic, further limitations apply (see [6]).

Table-Valued Functions

A table-valued function is a user-defined function in which returns a table as a result, as opposed to a single data value, as scalar functions do.

Let's support that we need to restrict base the logic based on a time interval. We'd need then to provide the StartDate & EndDate as parameters. Compared with other UDFs table-valued functions, as their name implies, need to return a table:

-- creating the UDF function 
CREATE FUNCTION dbo.tvfOpenPurchaseOrdersByProduct( 
  @StartDate date 
, @EndDate date) 
RETURNS TABLE 
AS RETURN ( 
	SELECT POL.ProductId 
	, SUM(POL.OrderQty) PurchQty
	FROM Purchasing.PurchaseOrderDetail POL
	WHERE OrderQty - (ReceivedQty - RejectedQty)>0
	  AND POL.DueDate BETWEEN @StartDate AND @EndDate
	GROUP BY POL.ProductId 
)

-- testing the UDF
SELECT top 10 *
FROM dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31')

A table-valued function can be used as a "table with parameters" in JOINs:

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM Production.Product ITM
     JOIN dbo.tvfOpenPurchaseOrdersByProduct('2014-01-01', '2014-12-31') POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

The parameters are optional, though in such cases using a view might still be a better idea. Table-valued functions used to have poor performance in the past compared with views and in certain scenarios they might still perform poorly. Their benefit resides in allowing to pass and use parameters in the logic, which can make them irreplaceable. Moreover, multi-statement table-valued functions can be built as well (see example)!

Notes:
1) When evaluating table-valued functions for usage consider their limitations as well (see [3])!
2) Scalar UDFs can be used to simplify the code as well, though they apply only to single values, therefore they are not considered in here!

Temporary Tables 

A temporary table is a base table that is stored and managed in tempdb as any other table. It exists only while the database session in which it was created is active. Therefore, it can be called multiple times, behaving much like a standard table:

-- create the temp table
CREATE TABLE dbo.#OpenPOs (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO #OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table-valued function)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN dbo.#OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

-- drop the table (cleaning)
-- DROP TABLE IF EXISTS dbo.#OpenPOs;

Being created in the tempdb, system database shared by several databases, temporary table's performance relies on tempdb's configuration and workload. Moreover, the concurrent creation of temporary tables from many sessions can lead to tempdb metadata contention, as each session attempts updating metadata information in the system based tables.

Temporary tables are logged, which adds more burden on the database engine, however being able to create indexes on them and use statistics can help processing result sets more efficiently, especially when called multiple times. 

Also, a temporary table might be cached (see [1]) and not deleted when its purpose ends, which allows operations that drop and create the objects to execute very quickly and reduces page allocation contention.

Table Variables

A table variable is a variable of type TABLE and can be used in functions, stored procedures, and batches. The construct is similar to the temp table and is stored as well in the tempdb and cached under certain scenarios, however they are scoped to the batch or routine in which they are defined and destroyed after that. 

-- create the table variable
DECLARE @OpenPOs TABLE (
  ProductId int NOT NULL
, PurchQty decimal(8,2) NOT NULL
)

-- insert the cumulated purchase orders
INSERT INTO @OpenPOs
SELECT POL.ProductId 
, SUM(POL.OrderQty) PurchQty
FROM Purchasing.PurchaseOrderDetail POL
WHERE OrderQty - (ReceivedQty - RejectedQty)>0
GROUP BY POL.ProductId 

-- products with open purchase orders (table variable)
SELECT ITM.ProductNumber
, ITM.Name
, POL.PurchQty
FROM [Production].[Product] ITM
     JOIN @OpenPOs POL
	   ON ITM.ProductId = POL.ProductId
ORDER BY ITM.ProductNumber

Table variables don’t participate in transactions or locking, while DML operations done on them are not logged. There are also no statistics maintained and any data changes impacting the table variable will not cause recompilation. Thus, they are usually faster than temporary variables, especially when their size is small, though their performance depends also on how they are used. On the other side, for big result sets and/or when several calls are involved, a temporary table could prove to be more efficient. 

Important!!! Temporary tables and table variables are means of improving the performance of long-running queries. Being able to move pieces of logic around helps in maintaining the code and it also provides a logical structure of the steps, however they shouldn't be used if the performance gain is not the target! Overusing them as technique can considerably decrease the performance of tempdb, which can have impact in other areas!

Azure Synapse

Moving to Azure Synapse there are several important limitations in what concerns the above (see [4]). Even if some features are supported, further limitations might apply. What's important to note is that materialized views act like indexed view in standard SQL Server and that CETAS (Create External Table as SELECT) are available to import/export data to the supported file formats in Hadoop, Azure storage blob or Azure Data Lake Storage Gen2.

FeatureDedicatedServerlessSQL Server
CTEsYesYesYes (2015+)
Recursive CTEsNoNoYes (2015+)
ViewsYesYesYes
Indexed viewsNoNoYes
Materialized viewsYesNoNo
Table-valued functions (single statement)NoYesYes
Table-valued functions (multi-statement)NoNoYes
Scalar UDFs YesNoYes
TablesYesNoYes
Temporary tables (local)YesLimitedYes
Temporary tables (global)NoNoYes
Table variablesYesYesYes
CETASYesLimitedYes (2022+)

Notes:
1) CETAS have two important limitations in serverless SQL Pool
    a) once the data were exported to a file, they can't be overwritten via the same syntax;
    b) logic based on temporary tables can't be exported via pipelines.
2) Temporary tables can be used to replace cursors (see example).

Previous Post  <<||>>  Next Post

Resources:
[1] Microsoft Learn (2012) Capacity Planning for tempdb (link)
[2] Microsoft Learn (2023) CREATE View (link)
[3] Microsoft Learn (2023) CREATE Function (link)
[4] Microsoft Learn (2023) Transact-SQL features supported in Azure Synapse SQL (link)
[5] Redgate (2018) Choosing Between Table Variables and Temporary Tables (ST011, ST012), by Phil Factor (link)
[6] Microsoft Learn (2023) Create indexed views (link)
[7] Microsoft Learn (2023) CREATE MATERIALIZED VIEW AS SELECT (Transact-SQL) (link)
[8] Microsoft Learn (2023) CETAS with Synapse SQL (link)

19 November 2022

💎🏭SQL Reloaded: Tricks with Strings via STRING_SPLIT, PATINDEX and TRANSLATE

Searching for a list of words within a column can be easily achieved by using the LIKE operator:

-- searching for several words via LIKE (SQL Server 2000+)
SELECT * 
FROM Production.Product 
WHERE Name LIKE '%chain%'
   OR Name LIKE '%lock%'
   OR Name LIKE '%rim%'
   OR Name LIKE '%spindle%'

The search is quite efficient, if on the column is defined an index, a clustered index scan being more likely chosen.

If the list of strings to search upon becomes bigger, the query becomes at least more difficult to maintain. Using regular expressions could be a solution. Unfortunately, SQL Server has its limitations in working with patterns. For example, it doesn't have a REGEXP_LIKE function, which is used something like (not tested):

-- Oracle 
SELECT * 
FROM Production.Product 
WHERE REGEXP_LIKE(lower(Name), 'chain|lock|rim|spindle')

However, there's a PATINDEX function which returns the position of a pattern within a string, and which uses the same wildcards that can be used with the LIKE operator:

-- searching for a value via PATINDEX (SQL Server 2000+)
SELECT * 
FROM [Production].[Product] 
WHERE PATINDEX('%rim%', Name)>0

Even if together with the Name can be provided only one of the values, retrieving the values from a table or a table-valued function (TVF) would do the trick. If the values need to be reused in several places, they can be stored in a table or view. If needed only once, a common table expression is more indicated:

-- filtering for several words via PATHINDEX (SQL Server 2008+)
WITH CTE 
AS (
  -- table from list of values (SQL Server 2008+)
SELECT * FROM (VALUES ('chain') , ('lock') , ('rim') , ('spindle')) DAT(words) ) SELECT * FROM Production.Product PRD WHERE EXISTS ( SELECT * FROM CTE WHERE PATINDEX('%'+ CTE.words +'%', PRD.Name)>0 )

The query should return the same records as above in the first query!

Besides own's UDFs (see SplitListWithIndex or SplitList), starting with SQL Server 2017 can be used the STRING_SPLIT function to return the same values as a TVF:

-- filtering for several words via PATHINDEX & STRING_SPLIT (SQL Server 2017+)
SELECT * 
FROM Production.Product PRD
WHERE EXISTS (
	SELECT *
	FROM STRING_SPLIT('chain|lock|rim|spindle', '|') SPL
	WHERE PATINDEX('%'+ SPL.value +'%', PRD.Name)>0
	)

A dynamic list of values can be built as well. For example, the list of words can be obtained from a table and the STRING_SPLIT function:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Name, ' ') SPL
ORDER BY SPL.value

One can remove the special characters, the numeric values, respectively the 1- and 2-letters words:

-- listing the words appearing in a column (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(Replace(Replace(Replace(Replace(Name, '-', ' '), ',', ' '), '/', ' '), '''', ' '), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

The output looks better, though the more complex the text, the more replacements need to be made. An alternative to a UDF (see ReplaceSpecialChars) is the TRANSLATE function, which replaces a list of characters with another. One needs to be careful and have a 1:1 mapping, the REPLICATE function doing the trick:

-- replacing special characters via TRANSLATE (SQL Server 2017+)
SELECT TRANSLATE(Name, '-,/''', Replicate(' ', 4))
FROM Production.Product PRD

Now the query becomes:

-- listing the words appearing in a column using TRANSLATE (SQL Server 2017+)
SELECT DISTINCT SPL.value
FROM Production.Product PRD
     CROSS APPLY STRING_SPLIT(TRANSLATE(Name, '-,/''', Replicate(' ', 4)), ' ') SPL
WHERE IsNumeric(SPL.value) = 0 -- removing numbers
  AND Len(SPL.value)>2 -- removing single/double letters
ORDER BY SPL.value

Notes:
1) The SQL Server-based queries work also in a SQL databases in Microsoft Fabric. Just replace the Production with SalesLT schema (see post, respectively GitHub repository with the changed code).

Happy coding!

18 October 2022

💎SQL Reloaded: Successive Price Increases/Discounts via Windowing Functions and CTEs

I was trying today to solve a problem that apparently requires recursive common table expressions, though they are not (yet) available in Azure Synapse serverless SQL pool. The problem can be summarized in the below table definition, in which given a set of Products with an initial Sales price, is needed to apply Price Increases successively for each Cycle. The cumulated increase is simulated in the last column for each line. 

Unfortunately, there is no SQL Server windowing function that allows multiplying incrementally the values of a column (similar as the running total works). However, there’s a mathematical trick that can be used to transform a product into a sum of elements by applying the Exp (exponential) and Log (logarithm) functions (see Solution 1), and which frankly is more elegant than applying CTEs (see Solution 2). 

-- create table with test data
SELECT *
INTO dbo.ItemPrices
FROM (VALUES ('ID001', 1000, 1, 1.02, '1.02')
, ('ID001', 1000, 2, 1.03, '1.02*1.03')
, ('ID001', 1000, 3, 1.03, '1.02*1.03*1.03')
, ('ID001', 1000, 4, 1.04, '1.02*1.03*1.03*1.04')
, ('ID002', 100, 1, 1.02, '1.02')
, ('ID002', 100, 2, 1.03, '1.02*1.03')
, ('ID002', 100, 3, 1.04, '1.02*1.03*1.04')
, ('ID002', 100, 4, 1.05, '1.02*1.03*1.04*1.05')
) DAT (ItemId, SalesPrice, Cycle, PriceIncrease, CumulatedIncrease)

-- reviewing the data
SELECT *
FROM dbo.ItemPrices

-- Solution 1: new sales prices with log & exp
SELECT ItemId
, SalesPrice
, Cycle
, PriceIncrease
, EXP(SUM(Log(PriceIncrease)) OVER(PARTITION BY Itemid ORDER BY Cycle)) CumulatedIncrease
, SalesPrice * EXP(SUM(Log(PriceIncrease)) OVER(PARTITION BY Itemid ORDER BY Cycle)) NewSalesPrice
FROM dbo.ItemPrices

-- Solution 2: new sales prices with recursive CTE
;WITH CTE 
AS (
-- initial record
SELECT ITP.ItemId
, ITP.SalesPrice
, ITP.Cycle
, ITP.PriceIncrease
, cast(ITP.PriceIncrease as decimal(38,6)) CumulatedIncrease
FROM dbo.ItemPrices ITP
WHERE ITP.Cycle = 1
UNION ALL
-- recursice part
SELECT ITP.ItemId
, ITP.SalesPrice
, ITP.Cycle
, ITP.PriceIncrease
, Cast(ITP.PriceIncrease * ITO.CumulatedIncrease as decimal(38,6))  CumulatedIncrease
FROM dbo.ItemPrices ITP
    JOIN CTE ITO
	  ON ITP.ItemId = ITO.ItemId
	 AND ITP.Cycle-1 = ITO.Cycle
)
-- final result
SELECT ItemId
, SalesPrice
, Cycle
, PriceIncrease
, CumulatedIncrease
, SalesPrice * CumulatedIncrease NewSalesPrice
FROM CTE
ORDER BY ItemId
, Cycle


-- validating the cumulated price increases (only last ones)
SELECT 1.02*1.03*1.03*1.04 
, 1.02*1.03*1.04*1.05

-- cleaning up
DROP TABLE IF EXISTS dbo.ItemPrices

Notes:
1. The logarithm in SQL Server’s implementation works only with positive numbers!
2. For simplification I transformed percentages (e.g. 1%) in values that are easier to multitply with (e.g. 1.01). The solution can be easily modified to consider discounts.
3. When CTEs are not available, one is forced to return to the methods used in SQL Server 2000 (I've been there) and thus use temporary tables or table variables with loops. Moreover, the logic can be encapsulated in multi-statement table-valued functions (see example), unfortunately, another feature not (yet) supported by serverless SQL pools. 
4. Unfortunately, STRING_AGG, which concatenates values across rows, works only with a GROUP BY clause. Anyway, its result is useless without the availability of a Eval function in SQL (see example), however the Expr function available in data flows could be used as workaround.
4. Even if I thought about the use of logarithms for transforming the product into a sum, I initially ignored the idea, thinking that the solution would be too complex to implement. So, the credit goes to another blogpost. Kudos!
5. The queries work also in a SQL databases in Microsoft Fabric. Just replace the Sales with SalesLT schema (see post, respectively GitHub repository with the changed code).

Happy coding!

13 September 2020

🎓Knowledge Management: Definitions II (What's in a Name)

Knowledge Management

Browsing through the various books on databases and programming appeared over the past 20-30 years, it’s probably hard not to notice the differences between the definitions given even for straightforward and basic concepts like the ones of view, stored procedure or function. Quite often the definitions lack precision and rigor, are circular and barely differentiate the defined term (aka concept) from other terms. In addition, probably in the attempt of making the definitions concise, important definitory characteristics are omitted.

Unfortunately, the same can be said about other non-scientific books, where the lack of appropriate definitions make the understanding of the content and presented concepts more difficult. Even if the reader can arrive in time to an approximate understanding of what is meant, one might have the feeling that builds castles in the air as long there is no solid basis to build upon – and that should be the purpose of a definition – to offer the foundation on which the reader can build upon. Especially for the readers coming from the scientific areas this lack of appropriateness and moreover, the lack of definitions, feels maybe more important than for the professional who already mastered the respective areas.

In general, a definition of a term is a well-defined descriptive statement which serves to differentiate it from related concepts. A well-defined definition should be meaningful, explicit, concise, precise, non-circular, distinct, context-dependent, relevant, rigorous, and rooted in common sense. In addition, each definition needs to be consistent through all the content and when possible, consistent with the other definitions provided. Ideally the definitions should cover as much of possible from the needed foundation and provide a unitary consistent multilayered non-circular and hierarchical structure that facilitates the reading and understanding of the given material.

Thus, one can consider the following requirements for a definition:

Meaningful: the description should be worthwhile and convey the required meaning for understanding the concept.

Explicit: the description must state clearly and provide enough information/detail so it can leave no room for confusion or doubt.

Context-dependent: the description should provide upon case the context in which the term is defined.

Concise: the description should be as succinct as possible – obtaining the maximum of understanding from a minimum of words.

Precise: the description should be made using unambiguous words that provide the appropriate meaning individually and as a whole.

Intrinsic non-circularity: requires that the term defined should not be used as basis for definitions, leading thus to trivial definitions like “A is A”.

Distinct: the description should provide enough detail to differentiate the term from other similar others.

Relevant: the description should be closely connected or appropriate to what is being discussed or presented.

Rigorous: the descriptions should be the result of a thorough and careful thought process in which the multiple usages and forms are considered.  

Extrinsic non-circularity: requires that the definitions of two distinct terms should not be circular (e.g. term A’s definition is based on B, while B’s definition is based on A), situation usually met occasionally in dictionaries.

Rooted in common sense: the description should not deviate from the common-sense acceptance of the terms used, typically resulted from socially constructed or dictionary-based definitions.

Unitary consistent multilayered hierarchical structure: the definitions should be given in an evolutive structure that facilitates learning, typically in the order in which the concepts need to be introduced without requiring big jumps in understanding. Even if concepts have in general a networked structure, hierarchies can be determined, especially based on the way concepts use other concepts in their definitions. In addition, the definitions must be consistent – hold together – respectively be unitary – form a whole.

21 May 2020

💎🏭SQL Reloaded: Functions Useful in Data Migrations

Left and Right-Padding

Oracle has the LPAD and RPAD functions which return and expression, left-padded, respectively right-padded with the specified characters. The functions are useful in formatting unique identifiers to fit a certain format (e.g. 0000012345 instead of 12345). Here’re similar implementations of the respective functions:

-- left padding function 
CREATE FUNCTION dbo.LeftPadding( 
  @str varchar(50) 
, @length int  
, @padchar varchar(1))
RETURNS varchar(50)
AS 
BEGIN 
  RETURN CASE  
    WHEN LEN(@str)<@length THEN Replicate(@padchar, @length-LEN(@str)) + @str       
    ELSE @str  
    END 
END  

-- example left padding 
SELECT dbo.LeftPadding('12345', '10', '0')

-- right padding function 
CREATE FUNCTION dbo.RightPadding( 
  @str varchar(50) 
, @length int  
, @padchar varchar(1))
RETURNS varchar(50)
AS 
BEGIN 
  RETURN CASE  
    WHEN LEN(@str)<@length THEN @str + Replicate(@padchar, @length-LEN(@str))      
	ELSE @str  
  END
END  

-- example right padding 
SELECT dbo.RightPadding('12345', '10', '0')

Left and Right Side

When multiple pieces of data are stored within same same attribute, it’s helpful to get the left, respectively the right part of the string based on a given delimiter, where the reverse flag tells the directions in which the left is applied:

-- left part function 
CREATE FUNCTION dbo.CutLeft( 
  @str varchar(max) 
, @delimiter varchar(1)
, @reverse bit = 0)
RETURNS varchar(max)
AS 
BEGIN 
  RETURN CASE  
     WHEN CharIndex(@delimiter, @str)>0 THEN 
       CASE @reverse 
         WHEN 0 THEN Left(@str, CharIndex(@delimiter, @str)-1)    
         ELSE Left(@str, Len(@str)-CharIndex(@delimiter, Reverse(@str))) 
       END
	 ELSE @str  
  END
END  

-- example left part 
SELECT dbo.CutLeft('12345,045,000', ',', 0)
, dbo.CutLeft('12345,045,000', ',', 1)



-- right part function 
CREATE FUNCTION dbo.CutRight( 
  @str varchar(max) 
, @delimiter varchar(1)
, @reverse bit = 0)
RETURNS varchar(max)
AS 
BEGIN 
  RETURN CASE  
    WHEN CharIndex(@delimiter, @str)>0 THEN 
      CASE @reverse 
        WHEN 0 THEN Right(@str, CharIndex(@delimiter, Reverse(@str))-1) 
        ELSE Right(@str, Len(@str)-CharIndex(@delimiter, @str)) 
      END
   ELSE @str  
  END
END  

-- example right part 
SELECT dbo.CutRight('12345,045,000', ',', 0)
, dbo.CutRight('12345,045,000', ',', 1)

Replacing Special Characters

Special characters can prove to be undesirable in certain scenarios (e.g. matching values, searching). Until a “Replace from” function will be made available, the solution is to include the replacements in a user-defined function similar with the one below:

DROP FUNCTION IF EXISTS [dbo].[ReplaceSpecialChars]

CREATE FUNCTION [dbo].[ReplaceSpecialChars](
@string nvarchar(max)
, @replacer as nvarchar(1) = '-'
) RETURNS nvarchar(max)
-- replaces special characters with a given character (e.g. an empty string, space)
AS
BEGIN   
  IF CharIndex('*', @string) > 0  
     SET @string = replace(@string, '*', @replacer)	
    
  IF CharIndex('#', @string) > 0  
     SET @string = replace(@string, '#', @replacer)	
    
  IF CharIndex('$', @string) > 0  
     SET @string = replace(@string, '$', @replacer)	
    
  IF CharIndex('%', @string) > 0  
     SET @string = replace(@string, '%', @replacer)	
    
  IF CharIndex('&', @string) > 0  
     SET @string = replace(@string, '&', @replacer)	
    
  IF CharIndex(';', @string) > 0  
     SET @string = replace(@string, ';', @replacer)	
    
  IF CharIndex('/', @string) > 0  
     SET @string = replace(@string, '/', @replacer)	
    
  IF CharIndex('?', @string) > 0  
     SET @string = replace(@string, '?', @replacer)	
    
  IF CharIndex('\', @string) > 0  
     SET @string = replace(@string, '\', @replacer)	
    
  IF CharIndex('(', @string) > 0  
     SET @string = replace(@string, '(', @replacer)	
    
  IF CharIndex(')', @string) > 0  
     SET @string = replace(@string, ')', @replacer)	
    
  IF CharIndex('|', @string) > 0  
     SET @string = replace(@string, '|', @replacer)	
    
  IF CharIndex('{', @string) > 0  
     SET @string = replace(@string, '{', @replacer)	
    
  IF CharIndex('}', @string) > 0  
     SET @string = replace(@string, '}', @replacer)	
    
  IF CharIndex('[', @string) > 0  
     SET @string = replace(@string, '[', @replacer)	
    
  IF CharIndex(']', @string) > 0  
     SET @string = replace(@string, ']', @replacer)	
                                
  RETURN (LTrim(RTrim(@string)))
END

SELECT [dbo].[ReplaceSpecialChars]('1*2#3$4%5&6;7/8?9\10(11)12|13{14}15[16]', '')
SELECT [dbo].[ReplaceSpecialChars]('1*2#3$4%5&6;7/8?9\10(11)12|13{14}15[16]', ' ')

Other type of special characters are the umlauts (e.g. ä, ß, ö, ü from German language):

DROP FUNCTION IF EXISTS [dbo].[ReplaceUmlauts]

CREATE FUNCTION [dbo].[ReplaceUmlauts](
@string nvarchar(max)
) RETURNS nvarchar(max)
-- replaces umlauts with their equivalent
AS
BEGIN   
  IF CharIndex('ä', @string) > 0  
     SET @string = replace(@string, 'ä', 'ae')	
    
  IF CharIndex('ö', @string) > 0  
     SET @string = replace(@string, 'ö', 'oe')	
    
  IF CharIndex('ß', @string) > 0  
     SET @string = replace(@string, 'ß', 'ss')	
    
  IF CharIndex('%', @string) > 0  
     SET @string = replace(@string, 'ü', 'ue')	
    
                                
  RETURN Trim(@string)
END

SELECT [dbo].[ReplaceUmlauts]('Hr Schrötter trinkt ein heißes Getränk')

Handling Umlauts

Another type of specials characters are the letter specific to certain languages that deviate from the Latin characters (e.g. umlauts in German, accents in French), the usage of such characters introducing further challenges in handling the characters, especially when converting the data between characters sets. A common scenario is the one in which umlauts in ISO-8891-1 are encoded using two character sets. Probably the easiest way to handle such characters is to write a function as follows:

-- create the function 
CREATE FUNCTION [dbo].[ReplaceCodes2Umlauts](
  @string nvarchar(max)
) RETURNS nvarchar(max)
-- replaces ISO 8859-1 characters with the corresponding encoding 
AS
BEGIN   
  IF CharIndex('ß', @string) > 0  
     SET @string = replace(@string, 'ß', 'ß')	

  IF CharIndex('ö', @string) > 0  
     SET @string = replace(@string, 'ö', 'ö')	

  IF CharIndex('Ö', @string) > 0  
     SET @string = replace(@string, 'Ö', 'Ö')	
    
  IF CharIndex('ü', @string) > 0  
     SET @string = replace(@string, 'ü', 'ü')	

  IF CharIndex('Ü', @string) > 0  
     SET @string = replace(@string, 'Ãœ', 'Ü')	

  IF CharIndex('ä', @string) > 0  
     SET @string = replace(@string, 'ä', 'ä')

  IF CharIndex('Ä', @string) > 0  
     SET @string = replace(@string, 'Ä', 'Ä')
                                
  RETURN (LTrim(RTrim(@string)))
END


--test the function 
SELECT [dbo].[ReplaceCodes2Umlauts]('Falsches Ãœben von Xylophonmusik quält jeden größeren Zwerg')

In the inverse scenario, at least for the German language is possible to replace the umlauts with a set the corresponding transliterations:

-- drop the function 
DROP FUNCTION IF EXISTS [dbo].[ReplaceUmlauts]

-- create the function 
CREATE FUNCTION [dbo].[ReplaceUmlauts](
  @string nvarchar(max)
) RETURNS nvarchar(max)
-- replaces umlauts with corresponding transliterations 
AS
BEGIN   
  IF CharIndex('ß', @string) > 0  
     SET @string = replace(@string, 'ß', 'ss')	

  IF CharIndex('ü', @string) > 0  
     SET @string = replace(@string, 'ü', 'ue')	

  IF CharIndex('ä', @string) > 0  
     SET @string = replace(@string, 'ä', 'ae')	
    
  IF CharIndex('ö', @string) > 0  
     SET @string = replace(@string, 'ö', 'oe')	
                                
  RETURN (LTrim(RTrim(@string)))
END

--test the function 
SELECT [dbo].[ReplaceUmlauts]('Falsches üben von Xylophonmusik quält jeden größeren Zwerg')

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.