22 January 2025

🏭🗒️Microsoft Fabric: Clone Tables in Warehouses [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 22-Jan-2025

[Microsoft Fabric] Zero-copy Clone

  • {def} a replica of an existing OneLake table created by copying existing table's metadata and referencing its data files [1]
    • the metadata is copied while the underlying data of the table stored as parquet files is not copied [1]
    • its creation is like creating a delta table [1]
    • DML/DDL changes on the source 
      • are not reflected in the clone table [1]
      • are not reflected on the source [1]
    • can be created within or across schemas in a warehouse [1]
    • created based on either:
      • current point-in-time
        • based on the present state of the table [1]
      • previous point-in-time
        • based on a point-in-time up to seven days in the past
          • the table clone contains the data as it appeared at a desired past point in time
          • all CRUD operations are retained for seven calendar days
        • created with a timestamp based on UTC
  • {characteristic} autonomous existence
    • the original source and the clones can be deleted without any constraints [1]
    • once a clone is created, it remains in existence until deleted by the user [1]
  • {characteristic} inherits 
    • object-level SQL security from the source table of the clone [1]
      • DENY permission can be set on the table clone if desired [1]
        • the workspace roles provide read access by default [1]
    • all attributes that exist at the source table, whether the clone was created within the same schema or across different schemas in a warehouse [1]
    • the primary and unique key constraints defined in the source table [1]
  • a read-only delta log is created for every table clone that is created within the Warehouse [1]
  • {benefit} facilitates development and testing processes 
    • by creating copies of tables in lower environments [1]
  • {benefit} provides consistent reporting and zero-copy duplication of data for analytical workloads and ML modeling and testing [1]
  • {benefit} provides the capability of data recovery in the event of a failed release or data corruption by retaining the previous state of data [1]
  • {benefit} helps create historical reports that reflect the state of data as it existed as of a specific point-in-time in the past [1]
  • {limitation} table clones across warehouses in a workspace are not currently supported [1]
  • {limitation} table clones across workspaces are not currently supported [1]
  • {limitation} clone table is not supported on the SQL analytics endpoint of the Lakehouse [1]
  • {limitation} clone of a warehouse or schema is currently not supported [1]
  • {limitation} table clones submitted before the retention period of seven days cannot be created [1]
  • {limitation} cloned tables do not currently inherit row-level security or dynamic data masking [1]
  • {limitation} changes to the table schema prevent a clone from being created prior to the table schema change [1]
  • {best practice} create the clone tables in dedicated schema(s)
  • [syntax] CREATE TABLE <schema.clone_table_name> AS CLONE OF <schema.table_name>

Previous Post  <<||>> Next Post

References:
[1] Microsoft Learn (2023) Clone table in Microsoft Fabric [link]
[2] Microsoft Learn (2024) Tutorial: Clone tables in the Fabric portal [link]
[3] Microsoft Learn (2024) Tutorial: Clone a table with T-SQL in a Warehouse [link]
[4] Microsoft Learn (2024) SQL: CREATE TABLE AS CLONE OF [link]

🏭🗒️Microsoft Fabric: Folders [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 22-Jan-2025

[Microsoft Fabric] Folders

  • {def} organizational units inside a workspace that enable users to efficiently organize and manage artifacts in the workspace [1]
  • identifiable by its name
    • {constraint} must be unique in a folder or at the root level of the workspace
    • {constraint} can’t include certain special characters [1]
      • C0 and C1 control codes [1]
      • leading or trailing spaces [1]
      • characters: ~"#.&*:<>?/{|} [1]
    • {constraint} can’t have system-reserved names
      • e.g. $recycle.bin, recycled, recycler.
    • {constraint} its length can't exceed 255 characters
  • {operation} create folder
    • can be created in
      • an existing folder (aka nested subfolder) [1]
        • {restriction} a maximum of 10 levels of nested subfolders can be created [1]
        • up to 10 folders can be created in the root folder [1]
        • {benefit} provide a hierarchical structure for organizing and managing items [1]
      • the root
  • {operation} move folder
  • {operation} rename folder
    • same rules applies as for folders’ creation [1]
  • {operation} delete folder
    • {restriction} currently can be deleted only empty folders [1]
      • {recommendation} make sure the folder is empty [1]
  •  {operation} create item in in folder
    • {restriction} certain items can’t be created in a folder
      • dataflows gen2
      • streaming semantic models
      • streaming dataflows
    • ⇐ items created from the home page or the Create hub, are created at the root level of the workspace [1]
  • {operation} move file(s) between folders [1]
  • {operation} publish to folder [1]
    •   Power BI reports can be published to specific folders
      • {restriction} folders' name must be unique throughout an entire workspace, regardless of their location [1]
        • when publishing a report to a workspace that has another report with the same name in a different folder, the report will publish to the location of the already existing report [1]
  • {limitation}may not be supported by certain features
    •   e.g. Git
  • {recommendation} use folders to organize workspaces [1]
  • {permissions}
    • inherit the permissions of the workspace where they're located [1] [2]
    • workspace admins, members, and contributors can create, modify, and delete folders in the workspace [1]
    • viewers can only view folder hierarchy and navigate in the workspace [1]
  • [deployment pipelines] deploying items in folders to a different stage, the folder hierarchy is automatically applied [2]

Previous Post  <<||>>  Next Post

References:
[1] Microsoft Fabric (2024) Create folders in workspaces [link]
[2] Microsoft Fabric (2024) The deployment pipelines process [link]
[3] Microsoft Fabric Updates Blog (2025) Define security on folders within a shortcut using OneLake data access roles [link]
[4] Microsoft Fabric Updates Blog (2025) Announcing the General Availability of Folder in Workspace [link]
[5] Microsoft Fabric Updates Blog (2025) Announcing Folder in Workspace in Public Preview [link]
[6] Microsoft Fabric Updates Blog (2025) Getting the size of OneLake data items or folders [link]

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)  

  • {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
    • includes 
      • the transportation of data
      • overlaps between stages
      • changes in flow 
        • due to 
          • new technologies
          • changing requirements
      • changes in scope
      • troubleshooting
        • due to data mismatches
  • {step} extraction
    • data is extracted directly from the source systems or intermediate repositories
      • data may be made available in intermediate repositories, when the direct access to the source system is not possible
        •  this approach can add a complexity layer
    •  {substep} data validation
      • an automated process that validates whether data pulled from sources has the expected values
      • relies on a validation engine
        • rejects data if it falls outside the validation rules
        • analyzes rejected records on an ongoing basis to
          • identifies what went wrong
          • corrects the source data
          • modifies extraction to resolve the problem in the next batches
  • {step} transform
    • transforms the data, removing extraneous or erroneous data
    • applies business rules 
    • checks data integrity
      • ensures that the data is not corrupted in the source or corrupted by ETL
      • may ensure no data was dropped in previous stages
    • aggregates the data if necessary
  • {step} load
    • {substep} store the data into a staging layer
      • transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
      • {advantage} makes it easier to roll back, if something went wrong
      • {advantage} allows to develop the logic iteratively and publish the data only when needed 
      • {advantage} can be used to generate audit reports for 
        • regulatory compliance 
        • diagnose and repair of data problems
      • modern ETL process perform transformations in place, instead of in staging areas
    • {substep} publish the data to the target
      • loads the data into the target table(s)
      • {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
        • this might happen daily, weekly, or monthly
      • {scenario} add new data without overriding
        • the timestamp can indicate the data is new
      • {recommendation} prevent the loading process to error out due to disk space and performance limitations
  • {approach} building an ETL infrastructure
    • involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
      • recurring process
        • e.g. data used for reporting
      • one-time process
        • e.g. data migration
    • may involve 
      • multiple source or destination systems 
      • different types of data
        • e.g. reference, master and transactional data
        • ⇐ may have complex dependencies
      • different level of structuredness
        • e.g. structured, semistructured, nonstructured 
      • different data formats
      • data of different quality
      • different ownership 
  • {recommendation} consider ETL best practices
    • {best practice} define requirements upfront in a consolidated and consistent manner
      • allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
      • the requirements may involve all the aspects of the process
        • e.g. data extraction, data transformation, standard formatting, etc.
    • {best practice} define a high level strategy
      • allows to define the road ahead, risks and other aspects 
      • allows to provide transparency
      • this may be part of a broader strategy that can be referenced 
    • {best practice} align the requirements and various aspects to the existing strategies existing in the organization
      • allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
      • e.g. IT, business, Information Security, Data Management strategies
    • {best practice} define the scope upfront
      • allows to better estimate the effort and validate the outcomes
      • even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
    • {best practice} manage the effort as a project and use a suitable Project Management methodology
      • allows to apply structured well-established PM practices 
      • it might be suited to adapt the methodology to project's particularities 
    • {best practice} convert data to standard formats to standardize data processing
      • allows to reduce the volume of issues resulted from data type mismatches
      • applies mainly to dates, numeric or other values for which can be defined standard formats
    • {best practice} clean the data in the source systems, when  cost-effective
      • allows to reduces the overall effort, especially when this is done in advance
      • this should be based ideally on the scope
    • {best practice} define and enforce data ownership
      • allows to enforce clear responsibilities across the various processes
      • allows to reduce the overall effort
    • {best practice} document data dependencies
      • document the dependencies existing in the data at the various levels
    • {best practice} protocol data movement from source(s) to destination(s) in term of data volume
      • allows to provide transparence into the data movement process
      • allows to identify gaps in the data or broader issues
      • can be used for troubleshooting and understanding the overall data growth
  • {recommendation} consider proven systems, architectures and methodologies
    • allows to minimize the overall effort and costs associated with the process
Previous Post <<||>> Next Post

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 20-Jan-2025

[Azure] Service Principal (SPN)  

  • {def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]
    • can be assigned precise permissions, making them perfect for automated processes or background services
      • allows to minimize the risks of human error and identity-based vulnerabilities
      • supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
      • authentication type 
        • supported only by [2]
          • Azure Data Lake Storage
          • Azure Data Lake Storage Gen2
          • Azure Blob Storage
          • Azure Synapse Analytics
          • Azure SQL Database
          • Dataverse
          • SharePoint online
        • doesn’t support
          • SQL data source with Direct Query in datasets [2]
  • when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]
    • the access to resources is restricted by the roles assigned to the SPN
      • ⇒ gives control over which resources can be accessed and at which level [4]
    • {recommendation} use SPN with automated tools [4]
      • rather than allowing them to sign in with a user identity  [4]
    • {prerequisite} an active Microsoft Entra user account with sufficient permissions to 
      • register an application with the tenant [4]
      • assign to the application a role in the Azure subscription [4]
      •  requires Application.ReadWrite.All permission [4]
  • extended to support Fabric Data Warehouses [1]
    • {benefit} automation-friendly API Access
      • allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
      • enables to automate repetitive tasks without relying on user credentials [1]
        • e.g. provisioning or managing warehouses
        • increases security by limiting human error
      • the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
      • applicable to users with administrator, member, or contributor workspace role [3]
      • minimizes risk
        • the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]
          • Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]
    • {benefit} seamless integration with Client Tools: 
      • tools like SSMS can connect to the Fabric DWH using SPN [1]
      • SPN provides secure access for developers to 
        • run COPY INTO
          • with and without firewall enabled storage [1]
        • run any T-SQL query programmatically on a schedule with ADF pipelines [1]
    • {benefit} granular access control
      • Warehouses can be shared with an SPN through the Fabric portal [1]
        • once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]
          • allows to control precisely which data and operations an SPN has access to  [1]
            • GRANT SELECT ON <table name> TO <Service principal name>  
      • warehouses' ownership can be changed from an SPN to user, and vice-versa [3]
    • {benefit} improved DevOps and CI/CD Integration
      • SPN can be used to automate the deployment and management of DWH resources [1]
        •  ensures faster, more reliable deployment processes while maintaining strong security postures [1]
    • {limitation} default semantic models are not supported for SPN created warehouses [3]
      • ⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]
    • {limitation} SPN for SQL analytics endpoints is not currently supported
    • {limitation} SPNs are currently not supported for COPY INTO error files [3]
      • ⇐ Entra ID credentials are not supported as well [3]
    • {limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
    • monitoring tools
      • [DMV] sys.dm_exec_sessions.login_name column [3] 
      • [Query Insights] queryinsights.exec_requests_history.login_name [3]
      • Query activity
        • submitter column in Fabric query activity [3]
      • Capacity metrics app: 
        • compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

References:
[1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]
[2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]
[3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link
[4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]
[5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link
 
Acronyms:
ADF - Azure Data Factory
API - Application Programming Interface
CI/CD - Continuous Integration/Continuous Deployment
DMV - Dynamic Management View
DWH - Data Warehouse
SPN - service principal
SSMS - SQL Server Management Studio

17 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VIII: Permissions) [new feature]

Data-based solutions usually target a set of users who (ideally) have restricted permissions to the functionality. Therefore, as part of the process are defined several personas that target different use cases, for which the permissions must be restricted accordingly. 

In the simplest scenario the user must have access to the underlying objects for querying the data. Supposing that an Entra User was created already, the respective user must be given access also in the Fabric database (see [1], [2]). From database's main menu follow the path to assign read permissions:
Security >> Manage SQL Security >> (select role: db_datareader)

Manage SQL Security
Manage SQL Security

Manage access >> Add >> (search for User)

Manage access
Manage access

(select user) >> Share database >> (select additional permissions) >> Save

Manage additional permissions
Manage additional permissions

The easiest way to test whether the permissions work before building the functionality is to login over SQL Server Management Studio (SSMS) and check the access using the Microsoft Entra MFA. Ideally, one should have a User's credentials that can be used only for testing purposes. After the above setup was done, the new User was able to access the data. 

A second User can be created for testing with the maximum of permissions allowed on the SQL database side, which is useful for troubleshooting. Alternatively, one can use only one User for testing and assign or remove the permissions as needed by the test scenario. 

It's a good idea to try to understand what's happening in the background. For example, the expectation was that for the Entra User created above also a SQL user is created, which doesn't seem to be the case, at least per current functionality available. 

 Before diving deeper, it's useful to retrieve User's details: 

-- retrieve current user
SELECT SUser_Name() sys_user_name
, User_Id() user_id 
, USER_NAME() user_name
, current_user [current_user]
, user [user]; 
Output:
sys_user_name user_id user_name current_user user
JamesClavell@[domain].onmicrosoft.com 0 JamesClavell@[domain].onmicrosoft.com JamesClavell@[domain].onmicrosoft.com JamesClavell@[domain].onmicrosoft.com

Retrieving the current User is useful especially when testing in parallel functionality with different Users. Strangely, User's ID is 0 when only read permissions were assigned. However, a valid User identifier is added for example when to the User is assigned also the db_datawriter role. Removing afterwards the db_datawriter role to the User keeps as expected User's ID. For troubleshooting purposes, at least per current functionality, it might be a good idea to create the Users with a valid User ID (e.g. by assigning temporarily the db_datawriter role to the User). 

The next step is to look at the Users with access to the database:

-- database access 
SELECT USR.uid
, USR.name
--, USR.sid 
, USR.hasdbaccess 
, USR.islogin
, USR.issqluser
--, USR.createdate 
--, USR.updatedate 
FROM sys.sysusers USR
WHERE USR.hasdbaccess = 1
  AND USR.islogin = 1
ORDER BY uid
Output:
uid name hasdbaccess islogin issqluser
1 dbo 1 1 1
6 CharlesDickens@[...].onmicrosoft.com 1 1 0
7 TestUser 1 1 1
9 JamesClavell@[...].onmicrosoft.com 1 1 0

For testing purposes, besides the standard dbo role and two Entra-based roles, it was created also a SQL role to which was granted access to the SalesLT schema (see initial post):

-- create the user
CREATE USER TestUser WITHOUT LOGIN;

-- assign access to SalesLT schema 
GRANT SELECT ON SCHEMA::SalesLT TO TestUser;
  
-- test impersonation (run together)
EXECUTE AS USER = 'TestUser';

SELECT * FROM SalesLT.Customer;

REVERT; 

Notes:
1) Strangely, even if access was given explicitly only to the SalesLT schema, the TestUser User has access also to sys.sysusers and other DMVs. That's valid also for the access over SSMS
2) For the above created User there are no records in the sys.user_token and sys.login_token DMVs, in contrast with the user(s) created for administering the SQL database. 

Let's look at the permissions granted explicitly:

-- permissions granted explicitly
SELECT DPR.principal_id
, DPR.name
, DPR.type_desc
, DPR.authentication_type_desc
, DPE.state_desc
, DPE.permission_name
FROM sys.database_principals DPR
     JOIN sys.database_permissions DPE
	   ON DPR.principal_id = DPE.grantee_principal_id
WHERE DPR.principal_id != 0 -- removing the public user
ORDER BY DPR.principal_id
, DPE.permission_name;
Result:
principal_id name type_desc authentication_type_desc state_desc permission_name
1 dbo SQL_USER INSTANCE GRANT CONNECT
6 CharlesDickens@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT AUTHENTICATE
6 CharlesDickens@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT CONNECT
7 TestUser SQL_USER NONE GRANT CONNECT
7 TestUser SQL_USER NONE GRANT SELECT
9 JamesClavell@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT CONNECT

During troubleshooting it might be useful to check current user's permissions at the various levels via sys.fn_my_permissions:

-- retrieve database-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions(NULL, 'Database');

-- retrieve schema-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions('SalesLT', 'Schema');

-- retrieve object-scoped permissions for current user
SELECT *
FROM sys.fn_my_permissions('SalesLT.Customer', 'Object')
WHERE permission_name = 'SELECT';

Notes:
1) See also [1] and [4] in what concerns the limitations that apply to managing permissions in SQL databases.

Happy coding!

Previous Post <<||>> Previous Post

References:
[1] Microsoft Learn (2024) Microsoft Fabric: Share your SQL database and manage permissions [link]
[2] Microsoft Learn (2024) Microsoft Fabric: Share data and manage access to your SQL database in Microsoft Fabric  [link]
[3] Microsoft Learn (2024) Authorization in SQL database in Microsoft Fabric [link]
[4] Microsoft Learn (2024) Authentication in SQL database in Microsoft Fabric [link]

[5] Microsoft Fabric Learn (2025) Manage access for SQL databases in Microsoft Fabric with workspace roles and item permissions [link

15 January 2025

🧭Business Intelligence: Perspectives (Part XXIII: In between the Many Destinations)

Business Intelligence Series
Business Intelligence Series

In too many cases the development of queries, respectively reports or data visualizations (aka artifacts) becomes a succession of drag & drops, formatting, (re)ordering things around, a bit of makeup, configuring a set of parameters, and the desired product is good to go! There seems nothing wrong with this approach as long as the outcomes meet users’ requirements, though it also gives the impression that’s all what the process is about. 

Given a set of data entities, usually there are at least as many perspectives into the data as entities’ number. Further perspectives can be found in exceptions and gaps in data, process variations, and the further aspects that can influence an artifact’s logic. All these aspects increase the overall complexity of the artifact, respectively of the development process. One guideline in handling all this is to keep the process in focus, and this starts with requirements’ elicitation and ends with the quality assurance and actual use.

Sometimes, the two words, the processes and their projection into the data and (data) models don’t reflect the reality adequately and one needs to compromise, at least until the gaps are better addressed. Process redesign, data harmonization and further steps need to be upon case considered in multiple iterations that should converge to optimal solutions, at least in theory. 

Therefore, in the development process there should be a continuous exploration of the various aspects until an optimum solution is reached. Often, there can be a couple of competing forces that can pull the solution in two or more directions  and then compromising is necessary. Especially as part of continuous improvement initiatives there’s the tendency of optimizing locally processes in the detriment of the overall process, with all the consequences resulting from this. 

Unfortunately, many of the problems existing in organizations are ill-posed and misunderstood to the degree that in extremis more effort is wasted than the actual benefits. Optimization is a process of putting in balance all the important aspects, respectively of answering with agility to the changing nature of the business and environment. Ignoring the changing nature of the problems and their contexts is a recipe for failure on the long term. 

This implies that people in particular and organizations in general need to become and  remain aware of the micro and macro changes occurring in organizations. Continuous learning is the key to cope with change. Organizations must learn to compromise and focus on what’s important, achievable and/or probable. Identifying, defining and following the value should be in an organization’s ADN. It also requires pragmatism (as opposed to idealism). Upon case, it may even require to say “no”, at least until the changes in the landscape offer a reevaluation of the various aspects.

One requires a lot from organizations when addressing optimization topics, especially when misalignment or important constraints or challenges may exist. Unfortunately, process related problems don’t always admit linear solutions. The nonlinear aspects are reflected especially when changing the scale, perspective or translating the issues or solutions from one are area to another.

There are probably answers available in the afferent literature or in the approaches followed by other organizations. Reinventing the wheel is part of the game, though invention may require explorations outside of the optimal paths. Conversely, an organization that knows itself has more chances to cope with the challenges and opportunities altogether. 

A lot from what organizations do in a consistent manner looks occasionally like inertia, self-occupation, suboptimal or random behavior, in opposition to being self-driven, self-aware, or in self-control. It’s also true that these are ideal qualities or aspects of what organizations should become in time. 

14 January 2025

🧭Business Intelligence: Perspectives (Part XXII: Queries' Complexity)

Business Intelligence Series
Business Intelligence Series

Independently whether standalone or encapsulated in database objects, the queries written can become complex in time, respectively difficult to comprehend and maintain. One can reduce the cognitive load by identifying the aspects that enable one’s intuition - order, similarity and origin being probably the most important aspects that help coping with the inherent complexity. 

One should start with the table having the lowest level of detail, usually a transaction table that forms the backbone of a certain feature. For example, for Purchase Orders this could be upon case the distribution or line level. If Invoices are added to the logic, and there could be multiple invoice line for a record from the former logic, then this becomes the new level of detail. Further on, if General Ledger lines are added, more likely this becomes the lowest level of detail, and so on.

This approach allows to keep a consistent way of building the queries while enabling to validate the record count, making sure that no duplicates are added to the logic. Conversely, one can start also from the table with the lowest level of details, and add tables successively based on their level of detail, though the database engine may generate upon case a suboptimal plan. In addition, checking the cardinality may involve writing a second query. 

One should try to keep the tables belonging to the same entity together, when possible (e.g. Purchase Order vs. Vendor information). This approach allows to reduce the volume of work required to manage, review, test and understand the query later. If the blocks are too big, then occasionally it makes sense to bring pieces of logic into CTEs (Common Table Expressions), or much better into views that allow to better encapsulate and partition the logic.

CTEs allow to split the logic into logical steps, allowing occasionally to troubleshoot the logic on pieces though one should keep a balance between maintainability and detail. In extremis, people may create unnecessarily an additional CTE for each join. The longer and the more fragmented a query, the more difficult it becomes to troubleshoot and even understand. Readability can be better achieved though indentation, by keeping things together that belong together, respectively partitioning the logic in logical blocks that derive from the model. 

Keeping things together should be applied also to the other elements. For example, the join constraints should be limited only to the fields participating in the join (and, if possible, all the other constraints should be brought in the WHERE clause). Bringing the join constraints in the WHERE clause, an approach used in the past, decreases query readability no matter how well the WHERE clause is structured, and occasionally can lead to constraints’ duplication or worse, to missing pieces of logic. 

The order of the attributes should follow a logic that makes sense. One approach is to start from the tables with lowest cardinality that identify entities uniquely and move to the attributes that have a higher level of detail. However, not all attributes are equally important, and thus one might have to compromise and create multiple groups of data coming from different levels. 

One should keep in mind that the more random the order of the attributes is, the more difficult it becomes to validate the data as one needs to jump multiple times either in the query or in the mask. Ideally one should find a balance between the two perspectives. Having an intuitive logic of how the attributes are listed increases queries’ readability, maintainability and troubleshooting. The more random attributes’ listing, the higher the effort for the same. 

Previous Post  <<||>>   Next Post

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) [new feature]

Microsoft does relatively a good job in documenting what doesn't work in Microsoft Fabric's SQL Databases. There's a good overview available already in the documentation, though beyond this the current post lists my finding while testing the previously written code on this blog,

USE Database

The standard syntax allows to change via USE the database context to the specified database or database snapshot. Unfortunately, this syntax doesn't seem to be supported currently and unfortunately many scripts seem to abuse of it. Thus, the following line of code throws an error:

-- changing the context
USE master;
GO
USE tempdb;

"Msg 40508, Level 16, State 1, Line 1, USE statement is not supported to switch between databases. Use a new connection to connect to a different database"

However, one can use the 3-part naming convention to reference the various objects:

-- sys metadata - retrieving the database files

SELECT *
FROM tempdb.sys.database_files dbf
ORDER BY name;

Even if the tempdb is not listed in the sys.databases table, it's still available for querying, which can prove helpful for troubleshooting. 

DBCC commands 

The documentation warns that some DBCC commands won't work, though in some cases there are also alternatives. For example:

-- clearing the procedure cache via DBCC
DBCC FREEPROCCACHE;
Output:
"Msg 2571, Level 14, State 9, Line 1, User '<user>' does not have permission to run DBCC freeproccache."

Alternatively, one can use the following command, which seems to work:

-- clearing the procedure cash via ALTER
ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

CHECKDB, which checks the logical and physical integrity of all the objects in the specified database, can't be used as well:
 
-- Checking the logical and physical integrity of a database
DBCC CHECKDB();
Output:
"Msg 916, Level 14, State 2, Line 1, The server principal "..." is not able to access the database "..." under the current security context."
The same error message is received for CHECKTABLE, utility which checks the integrity of all the pages and structures that make up the table (or indexed view):

-- checking a table's integrity
DBCC CHECKTABLE ('SalesLT.Address');
Output:
"Msg 916, Level 14, State 2, Line 2, The server principal "..." is not able to access the database "..." under the current security context."
A similar error messages is received for SQLPERF, which provides transaction log space usage statistics for all databases:

-- retrieving the LOGSPACE information for all databases
DBCC SQLPERF (LOGSPACE);
Output: 
"Msg 297, Level 16, State 10, Line 1, The user does not have permission to perform this action."
There are however DBCC commands like SHOW_STATISTICS or SHRINKDATABASE which do work. 
 
-- current query optimization statistics
DBCC SHOW_STATISTICS('SalesLT.Address','PK_Address_AddressID');
Output:
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows Persisted Sample Percent
PK_Address_AddressID Dec 21 2024 3:02AM 450 450 197 1 4 NO 450 0

SHRINKDATABASE shrinks the size of the data and log files in the specified database:

-- shrinking database
DBCC SHRINKDATABASE([AdventureWorks01-...]) WITH NO_INFOMSGS;

Update 29-Jan-2025: According to an answer from Ask the Expert session on Fabric Database [3], Microsoft seems to be working in bringing more DBCC features to SQL databases.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: USE <database> [link]
[2] Microsoft Learn (2024) Database console commands [link]
[3] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis) [new feature]

There are several system dynamic management views (DMV) available in SQL Server, Azure SQL Server and now in SQL databases that allow to gather more information about indexes' fragmentation and usage. Let's look at the most important information available based on the indexes create in the previous posts. As the data were probably purged from the views, it's needed to run first the select queries based on the SalesLT.Product from the previous post. This step is important, otherwise the DMVs might return no records!

One starting point is to use the sys.dm_db_index_physical_stats DMV to look at the indexes' size and fragmentation information for a given table (or view). The table is used usually as starting point for analyzing indexes' fragmentation and then defragment the indexes with high fragmentation.

-- sys metadata - index & data size and fragmentation information for the data and indexes of the specified table or view
SELECT --db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IPS.page_count
, IPS.record_count
, IPS.index_level
, Cast(IPS.avg_fragmentation_in_percent as decimal(10,2)) avg_fragmentation_perc
, Cast(IPS.avg_page_space_used_in_percent as decimal(10,2)) space_used_perc
--, IPS.*
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), IND.object_id, IND.index_id, NULL, 'DETAILED') IPS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
index_name type_desc page_count record_count index_level avg_fragmentation_perc space_used_perc
PK_Product_ProductID CLUSTERED 101 295 0 0.99 87.90
PK_Product_ProductID CLUSTERED 1 101 1 0.00 16.20
AK_Product_rowguid NONCLUSTERED 2 295 0 50.00 74.69
AK_Product_rowguid NONCLUSTERED 1 2 1 0.00 0.59
AK_Product_ProductNumber NONCLUSTERED 2 295 0 50.00 85.79
AK_Product_ProductNumber NONCLUSTERED 1 2 1 0.00 0.49
AK_Product_Name NONCLUSTERED 3 295 0 33.33 87.32
AK_Product_Name NONCLUSTERED 1 3 1 0.00 1.67
IX_SalesLT_Product_Color NONCLUSTERED 1 295 0 0.00 79.24
IX_SalesLT_Product_Color_Size NONCLUSTERED 1 295 0 0.00 94.12
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 4 295 0 0.00 86.60
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 1 4 1 0.00 1.01

In a second step one can look at the sys.dm_db_index_usage_stats DMV which provides the counts of the different types of index operations and the time each type of operation was last performed:

-- sys metadata - counts of different types of index operations and the time each type of operation was last performed.
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name
, IND.type_desc
, IUS.user_seeks 
, IUS.user_scans
, IUS.user_lookups 
, IUS.user_updates
, IUS.last_user_seek
, IUS.last_user_scan 
, IUS.last_user_lookup
, IUS.last_user_update
FROM sys.dm_db_index_usage_stats IUS
     JOIN sys.indexes IND
       ON IUS.index_id = IND.index_id
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
name type_desc user_seeks user_scans user_lookups user_updates last_user_seek last_user_scan last_user_lookup
PK_Product_ProductID CLUSTERED 0 10 15 0 2025-01-06T14:23:54 2025-01-06T14:23:54
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 2025-01-06T14:23:54
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 2025-01-06T13:38:03

Finally, it might be useful to look also at the sys.dm_db_index_operational_stats DMV which returns the current lower-level I/O, locking, latching, and access method activity for each partition of a table or index in the database (see the documentation for the full list of attrbutes):

-- sys metadata - index operations stats
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IOS.range_scan_count
, IOS.singleton_lookup_count
, IOS.leaf_insert_count
, IOS.leaf_delete_count
, IOS.leaf_update_count
, IOS.nonleaf_insert_count
, IOS.nonleaf_delete_count
, IOS.nonleaf_update_count
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_operational_stats(DB_ID(), IND.object_id, IND.index_id, NULL) IOS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product')
 AND IOS.range_scan_count<>0
ORDER BY IND.name;
Output:
index_name type_desc range_scan_count singleton_lookup_count leaf_insert_count leaf_delete_count leaf_update_count nonleaf_insert_count nonleaf_delete_count nonleaf_update_count
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 0 0 0 0
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 0 0 0 0
PK_Product_ProductID CLUSTERED 10 64 0 0 0 0 0 0

For more information on these DMVs check the documentation.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: sys.dm_db_index_physical_stats [link]
[2] Microsoft Learn (2024) SQL Server: sys.dm_db_index_usage_stats [link]
[3] Microsoft Learn (2024) SQL Server: sys.dm_db_index_operational_stats [link]

04 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) [new feature]

Indexes' maintenance in Microsoft Fabric's SQL databases is supposed to happen automatically in the background via automatic tuning options feature, though the whole functionality is still in its early phases, and therefore many questions regarding the whole process may arise. Probably the most important question is whether indexes can still be created, respectively maintained manually. That's useful for temporary or even periodic workloads, where maybe organizations might still want to maintain indexes manually. 

The tests made below are based on the SalesLT.Product from AdventureWorkds database available in Microsoft Fabric. The target was to create several indexes that could be used for the various testing purposes. Each set of the below scripts was run 5-10 times until records appeared in the sys.dm_db_missing_index_details table for each test case (see further below):

-- batch 1: filter on single column (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'

-- batch 2: filter on two columns (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'
     AND Size = '58'

-- batch 3: filter with column selection (to be run 5-10 times)
SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 50 AND 55

SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 100 and 105

Once the scripts run, one can look at the records created in the above considered dynamic management view:

-- sys metadata -  missing indexes
SELECT MID.statement AS table_name
, MID.equality_columns
, MID.inequality_columns
, MID.included_columns
--, MIG.index_group_handle
--, MIG.index_handle
FROM sys.dm_db_missing_index_details MID 
    JOIN sys.dm_db_missing_index_groups MIG 
     ON MID.index_handle =  MIG.index_handle
ORDER BY MIG.index_group_handle
, MIG.index_handle
Output:
table_name equality_columns inequality_columns included_columns
[AdventureWorks01-...].[SalesLT].[Product] [Color]
[AdventureWorks01-...].[SalesLT].[Product] [Color], [Size]
[AdventureWorks01-...].[SalesLT].[Product] [ListPrice] [Name], [ProductNumber], [Color]

The next step is to create one of the indexes (please note that database's name must be replaced accordingly or used only the 2-part naming convention - schema & table name ):

-- create index on Color
CREATE INDEX IX_SalesLT_Product_Color 
ON [AdventureWorks01-...].[SalesLT].[Product] (Color);

Once the script was run, all the records related to the SalesLT.Product disappeared from the dynamic management view. Therefore, it might be a good idea to take a snapshot with view's data before creating any indexes manually. Probably the same behavior should be expected when the indexes are created by the system.

-- create index on Color & Size
CREATE INDEX IX_SalesLT_Product_Color_Size
ON [SalesLT].[Product] (Color, Size);

-- create index on ListPrice with included columns
CREATE INDEX IX_SalesLT_Product_ListPrice_IC
ON [SalesLT].[Product] (ListPrice) INCLUDE(ProductNumber, Name, Color);

One can use the following query based on the meta.vIndexes (created in a previous post) to look at the indexes created:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name = 'Product'
  AND IND.index_name IN ('IX_SalesLT_Product_Color ','IX_SalesLT_Product_Color_Size'
,'IX_SalesLT_Product_ListPrice_IC')
ORDER BY IND.table_name
, IND.index_name
Output:
db_name schema_name table_name index_name index_type principal_type auto_created
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color_Size NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_ListPrice_IC NONCLUSTERED S False

After this model can be created further indexes as needed. It's always a good idea to take a "copy" of the indexes created (or keep a history of the scripts run for indexes' maintenance). This best practice is now more important, when the system can drop indexes as it considers fit. 

Don't forget to clean up the changes made if the indexes aren't needed anymore:

-- cleaning after
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color_Size;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_ListPrice_IC;

So, after these tests, the standard syntax for index's maintenance seems to work also on SQL databases, with all the implications deriving from this (e.g. porting of scripts, database objects, etc.)

Happy coding!

Previous Post <<||>> Next Post

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.