SQL Troubles

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

Business Intelligence Series

When building something that is supposed to last, one needs a solid foundation on which the artifact can be built upon. That’s valid for castles, houses, IT architectures, and probably most important, for BI infrastructures. There are so many tools out there that allow building a dashboard, report or other types of BI artifacts with a few drag-and-drops, moving things around, adding formatting and shiny things. In many cases all these steps are followed to create a prototype for a set of ideas or more formalized requirements keeping the overall process to a minimum.

Rapid prototyping, the process of building a proof-of-concept by focusing at high level on the most important design and functional aspects, is helpful and sometimes a mandatory step in eliciting and addressing the requirements properly. It provides a fast road from an idea to the actual concept, however the prototype, still in its early stages, can rapidly become the actual solution that unfortunately continues to haunt the dreams of its creator(s).

Especially in the BI area, there are many solutions that started as a prototype and gained mass until they start to disturb many things around them with implications for security, performance, data quality, and many other aspects. Moreover, the mass becomes in time critical, to the degree that it pulled more attention and effort than intended, with positive and negative impact altogether. It’s like building an artificial sun that suddenly becomes a danger for the nearby planet(s) and other celestial bodies.

When building such artifacts, it’s important to define what goals the end-result must or would be nice to have, differentiating clearly between them, respectively when is the time to stop and properly address the aspects mandatory in transitioning from the prototype to an actual solution that addresses the best practices in scope. It’s also the point when one should decide upon solution’s feasibility, needed quality acceptance criteria, and broader aspects like supporting processes, human resources, data, and the various aspects that have impact. Unfortunately, many solutions gain inertia without the proper foundation and in extremis succumb under the various forces.

Developing software artifacts of any type is a balancing act between all these aspects, often under suboptimal circumstances. Therefore, one must be able to set priorities right, react and change direction (and gear) according to the changing context. Many wish all this to be a straight sequential road, when in reality it looks more like mountain climbing, with many peaks, valleys and change of scenery. The more exploration is needed, the slower the progress.

All these aspects require additional time, effort, resources and planning, which can easily increase the overall complexity of projects to the degree that it leads to (exponential) effort and more important - waste. Moreover, the complexity pushes back, leading to more effort, and with it to higher costs. On top of this one has the iteration character of BI topics, multiple iterations being needed from the initial concept to the final solution(s), sometimes many steps being discarded in the process, corners are cut, with all the further implications following from this.

Somewhere in the middle, between minimum and the broad overextending complexity, is the sweet spot that drives the most impact with a minimum of effort. For some organizations, respectively professionals, reaching and remaining in the zone will be quite a challenge, though that’s not impossible. It’s important to be aware of all the aspects that drive and sustain the quality of artefacts, data and processes. There’s a lot to learn from successful as well from failed endeavors, and the various aspects should be reflected in the lessons learned.

Previous Post <<||>> Next Post

24 January 2025

🧭Business Intelligence: Perspectives (Part 24: Building Castles in the Air)

Business Intelligence Series

Business users have mainly three means of visualizing data – reports, dashboards and more recently notebooks, the latter being a mix between reports and dashboards. Given that all three types of display can be a mix of tabular representations and visuals/visualizations, the difference between them is often neglectable to the degree that the terms are used interchangeably.

For example, in Power BI a report is a "multi-perspective view into a single semantic model, with visualizations that represent different findings and insights from that semantic model" [1], while a dashboard is "a single page, often called a canvas, that uses visualizations to tell a story" [1], a dashboards’ visuals coming from one or more reports [2]. Despite this clear delimitation, the two concepts continue to be mixed and misused in conversations even by data-related professionals. This happens also because in other tools the vendors designate as dashboard what is called report in Power BI.

Given the limited terminology, it’s easy to generalize that dashboards are useless, poorly designed, bad for business users, and so on. As Stephen Few recognized almost two decades ago, "most dashboards fail to communicate efficiently and effectively, not because of inadequate technology (at least not primarily), but because of poorly designed implementations" [3]. Therefore, when people say that "dashboards are bad" refer to the result of poorly implementations, of what some of them were part of, which frankly is a different topic! Unfortunately, BI implementations reflect probably more than any other areas how easy is to fail!

Frankly, here it is not necessarily the poor implementation of a project management methodology at fault, which quite often happens, but the way requirements are defined, understood, documented and implemented. Even if these last aspects are part of the methodologies, they are merely a reflection of how people understand the business. The outcomes of BI implementations are rooted in other areas, and it starts with how the strategic goals and objectives are defined, how the elements that need oversight are considered in the broader perspectives. The dashboards become thus the end-result of a chain of failures, failing to build the business-related fundament on which the reporting infrastructure should be based upon. It’s so easy to shift the blame on what’s perceptible than on what’s missing!

Many dashboards are built because people need a sense of what’s happening in the business. It starts with some ideas based on the problems identified in organizations, one or more dashboards are built, and sometimes a lot of time is invested in the process. Then, some important progress is made, and all comes to a stale if the numbers don’t reveal something new, important, or whatever users’ perception is. Some might regard this as failure, though as long as the initial objectives were met, something was learned in the process and a difference was made, one can’t equate this with failure!

It’s more important to recognize the temporary character of dashboards, respectively of the requirements that lead to them and build around them. Of course, this requires occasionally a different approach to the whole topic. It starts with how KPIs and other business are defined and organized, respectively on how data repositories are built, and it ends with how data are visualized and reported.

As the practice often revealed, it’s possible to build castles in the air, without a solid foundation, though the expectation for such edifices to sustain the weight of businesses is unrealistic. Such edifices break with the first strong storm and unfortunately it's easier to blame a set of tools, some people or a whole department instead at looking critically at the whole organization!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Power BI: Glossary [link]

[2] Microsoft Learn (2024) Power BI: Dashboards for business users of the Power BI service [link]

[3] Stephen Few, "Information Dashboard Design", 2006

23 January 2025

💎SQL Reloaded: Number of Records VI (via sp_MSForEachTable Undocumented Stored Procedure)

Starting with SQL Server 2000 it's possible to execute a command via the undocumented stored procedure sp_MSForEachTable for each table available in a database, respectively for subsets of the tables. In a previous post I shown how the stored procedure can be used in several scenarios, including how to get the total number of records in each set of tables. However, the code used generates a result set for each table, which makes it difficult to aggregate the information for further processing. In many scenarios, it would be useful to store the result as a temporary or even persisted table.

-- dropping the tables
DROP TABLE IF EXISTS #Tables
DROP TABLE IF EXISTS #TablesRecordCount

-- create a temporary table to store the input list
SELECT TableName
INTO #Tables 
FROM (VALUES ('Person.Address')
, ('Person.AddressType')
, ('Person.BusinessEntity')) DAT(TableName)


-- create a temporary table to store the results
CREATE TABLE dbo.#TablesRecordCount (
  table_name nvarchar(150) NOT NULL
, number_records bigint
, run_date datetime2(0)
, comment nvarchar(255)
)

-- getting the number of records for the list of tables into the result table
INSERT INTO #TablesRecordCount
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records, GetDate() run_date, ''testing round 1'' comment FROM ?'
, @whereand = ' And Object_id In (Select Object_id(TableName) FROM #Tables)'

-- reviewing the result
SELECT *
FROM #TablesRecordCount
ORDER BY number_records DESC

The above solution uses two temporary tables, though it can be easily adapted to persist the result in a standard table: just replace the "#" with the schema part (e.g. "dbo."). This can be useful in troubleshooting scenarios, when the code is run at different points in time, eventually for different sets of tables.

The code is pretty simple and can be extended as needed. Unfortunately, there's no guarantee that the sp_MSForEachTable stored procedure will be supported in the next versions of the SQL Server. For example, the stored procedure is not available in SQL databases, respectively in Fabric warehouses. In SQL databases the following error is thrown:

"Msg 2812, Level 16, State 62, Line 1, Could not find stored procedure 'sys.sp_MSForEachTable'."

To test whether the feature works in your environment, it's enough to run a call to the respective stored procedure:

-- retrieve the record count for all tables
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records FROM ?'

Or, you can check whether it works for one table (replace the Person.AddressType table with one from your environment):

-- getting the number of records for the list of tables into another table
EXEC sp_MSForEachTable @command1='SELECT ''?'' [Table], COUNT(*) numer_records FROM ?'
, @whereand = ' And Object_id = Object_id(''Person.AddressType'')'

The solution could prove to be useful in multiple scenarios, though one should consider also the risk of being forced to rewrite the code when the used stored procedure becomes unavailable. Even if it takes more time to write, a solution based on cursors can be more feasible (see previous post).

Update 29-Jan-2025: Probably, despite their usefulness, the undocumented features will not be brought to SQL databases (see [1], 47:30). So, be careful about using the respective features as standard solutions in production environments!

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

💎🏭SQL Reloaded: Number of Records V (via Cursors, a Solution for Warehouses in Microsoft Fabric)

After deploying the sample warehouse available in Microsoft Fabric, I tried to check the number of records available in the deployed tables under the dbo schema. Surprisingly, the sys.partitions.count column has 0 values for all the tables associated with the respective schema (see post).

There are only a few tables available, and taking a record count for each table should be enough, which is relatively simple with the undocumented sp_MSForEachTable. Unfortunately, this approach doesn't work neither, so one needs to revert to the use of old-fashioned cursors (as I used to do in SQL Server 2000):

-- number of records via cursor
DECLARE @table_name nvarchar(150)
DECLARE @sql nvarchar(250)
DECLARE @number_records bigint 
DECLARE @number_tables int, @iterator int

DROP TABLE IF EXISTS dbo.#tables;

CREATE TABLE dbo.#tables (
  ranking int NOT NULL
, table_name nvarchar(150) NOT NULL
, number_records bigint
)

INSERT INTO #tables
SELECT row_number() OVER(ORDER BY object_id) ranking
, concat(schema_name(schema_id),'.', name) table_name
, NULL number_records
FROM sys.tables obj
WHERE obj.schema_id = schema_id('dbo')
ORDER BY table_name

SET @iterator = 1
SET @number_tables = IsNull((SELECT count(*) FROM #tables), 0)

WHILE (@iterator <= @number_tables)
BEGIN 
    SET @table_name = (SELECT table_name FROM #tables WHERE ranking = @iterator)
    SET @sql = CONCAT(N'SELECT @NumberRecords = count(*) FROM ', @table_name)

	BEGIN TRY
		--get the number of records
		EXEC sp_executesql @Query = @sql
		, @params = N'@NumberRecords bigint OUTPUT'
		, @NumberRecords = @number_records OUTPUT

		IF IsNull(@number_records, 0)> 0  
		BEGIN
                SET @sql = 'UPDATE #tables' 
             + ' SET number_records = ' + Str(@number_records)
             + ' WHERE table_name = ''' + @table_name + '''';

		 EXEC(@sql)
		END 
	END TRY
	BEGIN CATCH  
	 -- no action needed in case of error
        END CATCH;

	SET @iterator = @iterator + 1
END

SELECT *
FROM dbo.#tables;

--DROP TABLE IF EXISTS dbo.#tables;

Results:

ranking	table_name	number_records
1	dbo.Date	5844
2	dbo.Geography	305179
3	dbo.HackneyLicense	42958
4	dbo.Time	86400
5	dbo.Weather	526330
6	dbo.Trip	2838927
7	dbo.Medallion	13668

Comments:
1) It's a lot of code for a simple task, though the code can be easily duplicated and adapted for similar requirements. Unfortunately, it can lead in time also to many instances of the same code. When possible, one should consider maybe encapsulating the logic in a stored procedure.
2) It's usually a good idea to check how many records are available in the tables used for testing, as this can impact queries' performance and tables' appropriateness for the tests performed. Moreover, it's a good idea to understand the volume of data when taking over or working with a database.
3) If one removes the row_number function, the code should run also in SQL Server 2000. Similar solutions were used then for retrieving the record count.
4) Microsoft recommends not to drop the temporary tables explicitly, but let SQL Server handle this cleanup automatically and take thus advantage of the Optimistic Latching Algorithm, which helps prevent contention on TempDB [1]..
5) There are others who stumbled over this issue (see [1]).
6) The solution has been tested successfully also in SQL databases.
7) The whole code must be run together because the temporary table seems to have only a transitory scope! An attempt to rerun the last SELECT from #tables raises the error: "Invalid object name '#tables'"

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Koen Verbeeck (2024) Get row counts of all tables in a Microsoft Fabric warehouse [link]
[2] Haripriya SB (2024) Do NOT drop #temp tables (link)

22 January 2025

🏭🗒️Microsoft Fabric: Clone Tables in Warehouses [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 22-Jan-2025

[Microsoft Fabric] Zero-copy Clone

{def} a replica of an existing OneLake table created by copying existing table's metadata and referencing its data files [1]

the metadata is copied while the underlying data of the table stored as parquet files is not copied [1]
its creation is like creating a delta table [1]
DML/DDL changes on the source

are not reflected in the clone table [1]
are not reflected on the source [1]

can be created within or across schemas in a warehouse [1]
created based on either:

current point-in-time

based on the present state of the table [1]

previous point-in-time

based on a point-in-time up to seven days in the past

the table clone contains the data as it appeared at a desired past point in time
all CRUD operations are retained for seven calendar days

created with a timestamp based on UTC

{characteristic} autonomous existence

the original source and the clones can be deleted without any constraints [1]
once a clone is created, it remains in existence until deleted by the user [1]

{characteristic} inherits

object-level SQL security from the source table of the clone [1]

DENY permission can be set on the table clone if desired [1]

the workspace roles provide read access by default [1]

all attributes that exist at the source table, whether the clone was created within the same schema or across different schemas in a warehouse [1]
the primary and unique key constraints defined in the source table [1]

a read-only delta log is created for every table clone that is created within the Warehouse [1]
{benefit} facilitates development and testing processes

by creating copies of tables in lower environments [1]

{benefit} provides consistent reporting and zero-copy duplication of data for analytical workloads and ML modeling and testing [1]
{benefit} provides the capability of data recovery in the event of a failed release or data corruption by retaining the previous state of data [1]
{benefit} helps create historical reports that reflect the state of data as it existed as of a specific point-in-time in the past [1]
{limitation} table clones across warehouses in a workspace are not currently supported [1]
{limitation} table clones across workspaces are not currently supported [1]
{limitation} clone table is not supported on the SQL analytics endpoint of the Lakehouse [1]
{limitation} clone of a warehouse or schema is currently not supported [1]
{limitation} table clones submitted before the retention period of seven days cannot be created [1]
{limitation} cloned tables do not currently inherit row-level security or dynamic data masking [1]
{limitation} changes to the table schema prevent a clone from being created prior to the table schema change [1]
{best practice} create the clone tables in dedicated schema(s)
[syntax] CREATE TABLE <schema.clone_table_name> AS CLONE OF <schema.table_name>

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2023) Clone table in Microsoft Fabric [link]
[2] Microsoft Learn (2024) Tutorial: Clone tables in the Fabric portal [link]
[3] Microsoft Learn (2024) Tutorial: Clone a table with T-SQL in a Warehouse [link]
[4] Microsoft Learn (2024) SQL: CREATE TABLE AS CLONE OF [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

🏭🗒️Microsoft Fabric: Folders [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 22-Jan-2025

[Microsoft Fabric] Folders

{def} organizational units inside a workspace that enable users to efficiently organize and manage artifacts in the workspace [1]
identifiable by its name

{constraint} must be unique in a folder or at the root level of the workspace
{constraint} can’t include certain special characters [1]

C0 and C1 control codes [1]
leading or trailing spaces [1]
characters: ~"#.&*:<>?/{|} [1]

{constraint} can’t have system-reserved names

e.g. $recycle.bin, recycled, recycler.

{constraint} its length can't exceed 255 characters

{operation} create folder

can be created in

an existing folder (aka nested subfolder) [1]

{restriction} a maximum of 10 levels of nested subfolders can be created [1]
up to 10 folders can be created in the root folder [1]
{benefit} provide a hierarchical structure for organizing and managing items [1]

the root

{operation} move folder
{operation} rename folder

same rules applies as for folders’ creation [1]

{operation} delete folder

{restriction} currently can be deleted only empty folders [1]

{recommendation} make sure the folder is empty [1]

{operation} create item in folder

{restriction} certain items can’t be created in a folder

dataflows gen2
streaming semantic models
streaming dataflows

⇐ items created from the home page or the Create hub, are created at the root level of the workspace [1]

{operation} move file(s) between folders [1]
{operation} publish to folder [1]

Power BI reports can be published to specific folders

{restriction} folders' name must be unique throughout an entire workspace, regardless of their location [1]

when publishing a report to a workspace that has another report with the same name in a different folder, the report will publish to the location of the already existing report [1]

{limitation}may not be supported by certain features

e.g. Git

{recommendation} use folders to organize workspaces [1]
{permissions}

inherit the permissions of the workspace where they're located [1] [2]
workspace admins, members, and contributors can create, modify, and delete folders in the workspace [1]
viewers can only view folder hierarchy and navigate in the workspace [1]

[deployment pipelines] deploying items in folders to a different stage, the folder hierarchy is automatically applied [2]

Previous Post <<||>> Next Post

References:
[1] Microsoft Fabric (2024) Create folders in workspaces [link]
[2] Microsoft Fabric (2024) The deployment pipelines process [link]
[3] Microsoft Fabric Updates Blog (2025) Define security on folders within a shortcut using OneLake data access roles [link]
[4] Microsoft Fabric Updates Blog (2025) Announcing the General Availability of Folder in Workspace [link]
[5] Microsoft Fabric Updates Blog (2025) Announcing Folder in Workspace in Public Preview [link]
[6] Microsoft Fabric Updates Blog (2025) Getting the size of OneLake data items or folders [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)

{def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)

includes

the transportation of data
overlaps between stages
changes in flow

due to

new technologies
changing requirements

changes in scope
troubleshooting

due to data mismatches

{step} extraction

data is extracted directly from the source systems or intermediate repositories

data may be made available in intermediate repositories, when the direct access to the source system is not possible

⇐ this approach can add a complexity layer

{substep} data validation

an automated process that validates whether data pulled from sources has the expected values
relies on a validation engine

rejects data if it falls outside the validation rules
analyzes rejected records on an ongoing basis to

identifies what went wrong
corrects the source data
modifies extraction to resolve the problem in the next batches

{step} transform

transforms the data, removing extraneous or erroneous data
applies business rules
checks data integrity

ensures that the data is not corrupted in the source or corrupted by ETL
may ensure no data was dropped in previous stages

aggregates the data if necessary

{step} load

{substep} store the data into a staging layer

transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
{advantage} makes it easier to roll back, if something went wrong
{advantage} allows to develop the logic iteratively and publish the data only when needed
{advantage} can be used to generate audit reports for

regulatory compliance
diagnose and repair of data problems

modern ETL process perform transformations in place, instead of in staging areas

{substep} publish the data to the target

loads the data into the target table(s)
{scenario} the existing data are overridden every time the ETL pipeline loads a new batch

this might happen daily, weekly, or monthly

{scenario} add new data without overriding

the timestamp can indicate the data is new

{recommendation} prevent the loading process to error out due to disk space and performance limitations

{approach} building an ETL infrastructure

involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly

recurring process

e.g. data used for reporting

one-time process

e.g. data migration

may involve

multiple source or destination systems
different types of data

e.g. reference, master and transactional data
⇐ may have complex dependencies

different level of structuredness

e.g. structured, semistructured, nonstructured

different data formats
data of different quality
different ownership

{recommendation} consider ETL best practices

{best practice} define requirements upfront in a consolidated and consistent manner

allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
the requirements may involve all the aspects of the process

e.g. data extraction, data transformation, standard formatting, etc.

{best practice} define a high level strategy

allows to define the road ahead, risks and other aspects
allows to provide transparency
this may be part of a broader strategy that can be referenced

{best practice} align the requirements and various aspects to the existing strategies existing in the organization

allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
e.g. IT, business, Information Security, Data Management strategies

{best practice} define the scope upfront

allows to better estimate the effort and validate the outcomes
even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations

{best practice} manage the effort as a project and use a suitable Project Management methodology

allows to apply structured well-established PM practices
it might be suited to adapt the methodology to project's particularities

{best practice} convert data to standard formats to standardize data processing

allows to reduce the volume of issues resulted from data type mismatches
applies mainly to dates, numeric or other values for which can be defined standard formats

{best practice} clean the data in the source systems, when cost-effective

allows to reduces the overall effort, especially when this is done in advance
this should be based ideally on the scope

{best practice} define and enforce data ownership

allows to enforce clear responsibilities across the various processes
allows to reduce the overall effort

{best practice} document data dependencies

document the dependencies existing in the data at the various levels

{best practice} protocol data movement from source(s) to destination(s) in term of data volume

allows to provide transparence into the data movement process
allows to identify gaps in the data or broader issues
can be used for troubleshooting and understanding the overall data growth

{recommendation} consider proven systems, architectures and methodologies

allows to minimize the overall effort and costs associated with the process

Previous Post <<||>> Next Post

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 20-Jan-2025

[Azure] Service Principal (SPN)

{def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]

can be assigned precise permissions, making them perfect for automated processes or background services

allows to minimize the risks of human error and identity-based vulnerabilities
supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
authentication type

supported only by [2]

Azure Data Lake Storage
Azure Data Lake Storage Gen2
Azure Blob Storage
Azure Synapse Analytics
Azure SQL Database
Dataverse
SharePoint online

doesn’t support

SQL data source with Direct Query in datasets [2]

when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]

the access to resources is restricted by the roles assigned to the SPN

⇒ gives control over which resources can be accessed and at which level [4]

{recommendation} use SPN with automated tools [4]

rather than allowing them to sign in with a user identity [4]

{prerequisite} an active Microsoft Entra user account with sufficient permissions to

register an application with the tenant [4]
assign to the application a role in the Azure subscription [4]
⇐ requires Application.ReadWrite.All permission [4]

extended to support Fabric Data Warehouses [1]

{benefit} automation-friendly API Access

allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
enables to automate repetitive tasks without relying on user credentials [1]

e.g. provisioning or managing warehouses
increases security by limiting human error

the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
applicable to users with administrator, member, or contributor workspace role [3]
minimizes risk

the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]

Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]

{benefit} seamless integration with Client Tools:

tools like SSMS can connect to the Fabric DWH using SPN [1]
SPN provides secure access for developers to

run COPY INTO

with and without firewall enabled storage [1]

run any T-SQL query programmatically on a schedule with ADF pipelines [1]

{benefit} granular access control

Warehouses can be shared with an SPN through the Fabric portal [1]

once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]

allows to control precisely which data and operations an SPN has access to [1]

GRANT SELECT ON <table name> TO <Service principal name>

warehouses' ownership can be changed from an SPN to user, and vice-versa [3]

{benefit} improved DevOps and CI/CD Integration

SPN can be used to automate the deployment and management of DWH resources [1]

⇐ ensures faster, more reliable deployment processes while maintaining strong security postures [1]

{limitation} default semantic models are not supported for SPN created warehouses [3]

⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]

{limitation} SPN for SQL analytics endpoints is not currently supported
{limitation} SPNs are currently not supported for COPY INTO error files [3]

⇐ Entra ID credentials are not supported as well [3]

{limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
monitoring tools

[DMV] sys.dm_exec_sessions.login_name column [3]
[Query Insights] queryinsights.exec_requests_history.login_name [3]
Query activity

submitter column in Fabric query activity [3]

Capacity metrics app:

compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

Previous Post <<||>> Next Post

References:

[1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]

[2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]

[3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link]

[4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]

[5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link]

Resources:

[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:

ADF - Azure Data Factory

API - Application Programming Interface

CI/CD - Continuous Integration/Continuous Deployment

DMV - Dynamic Management View

DWH - Data Warehouse

SPN - service principal

SSMS - SQL Server Management Studio

17 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VIII: Permissions) [new feature]

Data-based solutions usually target a set of users who (ideally) have restricted permissions to the functionality. Therefore, as part of the process are defined several personas that target different use cases, for which the permissions must be restricted accordingly.

In the simplest scenario the user must have access to the underlying objects for querying the data. Supposing that an Entra User was created already, the respective user must be given access also in the Fabric database (see [1], [2]). From database's main menu follow the path to assign read permissions:
Security >> Manage SQL Security >> (select role: db_datareader)

Manage SQL Security

Manage access >> Add >> (search for User)

Manage access

(select user) >> Share database >> (select additional permissions) >> Save

Manage additional permissions

The easiest way to test whether the permissions work before building the functionality is to login over SQL Server Management Studio (SSMS) and check the access using the Microsoft Entra MFA. Ideally, one should have a User's credentials that can be used only for testing purposes. After the above setup was done, the new User was able to access the data.

A second User can be created for testing with the maximum of permissions allowed on the SQL database side, which is useful for troubleshooting. Alternatively, one can use only one User for testing and assign or remove the permissions as needed by the test scenario.

It's a good idea to try to understand what's happening in the background. For example, the expectation was that for the Entra User created above also a SQL user is created, which doesn't seem to be the case, at least per current functionality available.

Before diving deeper, it's useful to retrieve User's details:

-- retrieve current user
SELECT SUser_Name() sys_user_name
, User_Id() user_id 
, USER_NAME() user_name
, current_user [current_user]
, user [user];

Output:

sys_user_name	user_id	user_name	current_user	user
JamesClavell@[domain].onmicrosoft.com	0	JamesClavell@[domain].onmicrosoft.com	JamesClavell@[domain].onmicrosoft.com	JamesClavell@[domain].onmicrosoft.com

Retrieving the current User is useful especially when testing in parallel functionality with different Users. Strangely, User's ID is 0 when only read permissions were assigned. However, a valid User identifier is added for example when to the User is assigned also the db_datawriter role. Removing afterwards the db_datawriter role to the User keeps as expected User's ID. For troubleshooting purposes, at least per current functionality, it might be a good idea to create the Users with a valid User ID (e.g. by assigning temporarily the db_datawriter role to the User).

The next step is to look at the Users with access to the database:

-- database access 
SELECT USR.uid
, USR.name
--, USR.sid 
, USR.hasdbaccess 
, USR.islogin
, USR.issqluser
--, USR.createdate 
--, USR.updatedate 
FROM sys.sysusers USR
WHERE USR.hasdbaccess = 1
  AND USR.islogin = 1
ORDER BY uid

Output:

uid	name	hasdbaccess	islogin	issqluser
1	dbo	1	1	1
6	CharlesDickens@[...].onmicrosoft.com	1	1	0
7	TestUser	1	1	1
9	JamesClavell@[...].onmicrosoft.com	1	1	0

For testing purposes, besides the standard dbo role and two Entra-based roles, it was created also a SQL role to which was granted access to the SalesLT schema (see initial post):

-- create the user
CREATE USER TestUser WITHOUT LOGIN;

-- assign access to SalesLT schema 
GRANT SELECT ON SCHEMA::SalesLT TO TestUser;
  
-- test impersonation (run together)
EXECUTE AS USER = 'TestUser';

SELECT * FROM SalesLT.Customer;

REVERT;

Notes:
1) Strangely, even if access was given explicitly only to the SalesLT schema, the TestUser User has access also to sys.sysusers and other DMVs. That's valid also for the access over SSMS
2) For the above created User there are no records in the sys.user_token and sys.login_token DMVs, in contrast with the user(s) created for administering the SQL database.

Let's look at the permissions granted explicitly:

-- permissions granted explicitly
SELECT DPR.principal_id
, DPR.name
, DPR.type_desc
, DPR.authentication_type_desc
, DPE.state_desc
, DPE.permission_name
FROM sys.database_principals DPR
     JOIN sys.database_permissions DPE
	   ON DPR.principal_id = DPE.grantee_principal_id
WHERE DPR.principal_id != 0 -- removing the public user
ORDER BY DPR.principal_id
, DPE.permission_name;

Result:

principal_id	name	type_desc	authentication_type_desc	state_desc	permission_name
1	dbo	SQL_USER	INSTANCE	GRANT	CONNECT
6	CharlesDickens@[...].onmicrosoft.com	EXTERNAL_USER	EXTERNAL	GRANT	AUTHENTICATE
6	CharlesDickens@[...].onmicrosoft.com	EXTERNAL_USER	EXTERNAL	GRANT	CONNECT
7	TestUser	SQL_USER	NONE	GRANT	CONNECT
7	TestUser	SQL_USER	NONE	GRANT	SELECT
9	JamesClavell@[...].onmicrosoft.com	EXTERNAL_USER	EXTERNAL	GRANT	CONNECT

During troubleshooting it might be useful to check current user's permissions at the various levels via sys.fn_my_permissions:

-- retrieve database-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions(NULL, 'Database');

-- retrieve schema-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions('SalesLT', 'Schema');

-- retrieve object-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions('SalesLT.Customer', 'Object')
WHERE permission_name = 'SELECT';

Notes:
1) See also [1] and [4] in what concerns the limitations that apply to managing permissions in SQL databases.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: Share your SQL database and manage permissions [link]
[2] Microsoft Learn (2024) Microsoft Fabric: Share data and manage access to your SQL database in Microsoft Fabric [link]
[3] Microsoft Learn (2024) Authorization in SQL database in Microsoft Fabric [link]
[4] Microsoft Learn (2024) Authentication in SQL database in Microsoft Fabric [link]

[5] Microsoft Fabric Learn (2025) Manage access for SQL databases in Microsoft Fabric with workspace roles and item permissions [link]

SQL Troubles

Pages

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

26 January 2025

🧭Business Intelligence: Perspectives (Part 25: Grounding the Roots)

24 January 2025

🧭Business Intelligence: Perspectives (Part 24: Building Castles in the Air)

23 January 2025

💎SQL Reloaded: Number of Records VI (via sp_MSForEachTable Undocumented Stored Procedure)

💎🏭SQL Reloaded: Number of Records V (via Cursors, a Solution for Warehouses in Microsoft Fabric)

22 January 2025

🏭🗒️Microsoft Fabric: Clone Tables in Warehouses [Notes]

🏭🗒️Microsoft Fabric: Folders [Notes]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

17 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VIII: Permissions) [new feature]

About Me