Showing posts with label Microsoft Fabric. Show all posts
Showing posts with label Microsoft Fabric. Show all posts

21 January 2025

🧊🗒️Data Warehousing: Extract, Transform, Load (ETL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 21-Jan-2025

[Data Warehousing] Extract, Transform, Load (ETL)  

  • {def} automated process which takes raw data, extracts the data required for further processing, transforms it into a format that addresses business' needs, and loads it into the destination repository (e.g. data warehouse)
    • includes 
      • the transportation of data
      • overlaps between stages
      • changes in flow 
        • due to 
          • new technologies
          • changing requirements
      • changes in scope
      • troubleshooting
        • due to data mismatches
  • {step} extraction
    • data is extracted directly from the source systems or intermediate repositories
      • data may be made available in intermediate repositories, when the direct access to the source system is not possible
        •  this approach can add a complexity layer
    •  {substep} data validation
      • an automated process that validates whether data pulled from sources has the expected values
      • relies on a validation engine
        • rejects data if it falls outside the validation rules
        • analyzes rejected records on an ongoing basis to
          • identifies what went wrong
          • corrects the source data
          • modifies extraction to resolve the problem in the next batches
  • {step} transform
    • transforms the data, removing extraneous or erroneous data
    • applies business rules 
    • checks data integrity
      • ensures that the data is not corrupted in the source or corrupted by ETL
      • may ensure no data was dropped in previous stages
    • aggregates the data if necessary
  • {step} load
    • {substep} store the data into a staging layer
      • transformed data are not loaded directly into the target but staged into an intermediate layer (e.g. database)
      • {advantage} makes it easier to roll back, if something went wrong
      • {advantage} allows to develop the logic iteratively and publish the data only when needed 
      • {advantage} can be used to generate audit reports for 
        • regulatory compliance 
        • diagnose and repair of data problems
      • modern ETL process perform transformations in place, instead of in staging areas
    • {substep} publish the data to the target
      • loads the data into the target table(s)
      • {scenario} the existing data are overridden every time the ETL pipeline loads a new batch
        • this might happen daily, weekly, or monthly
      • {scenario} add new data without overriding
        • the timestamp can indicate the data is new
      • {recommendation} prevent the loading process to error out due to disk space and performance limitations
  • {approach} building an ETL infrastructure
    • involves integrating data from one or more data sources and testing the overall processes to ensure the data is processed correctly
      • recurring process
        • e.g. data used for reporting
      • one-time process
        • e.g. data migration
    • may involve 
      • multiple source or destination systems 
      • different types of data
        • e.g. reference, master and transactional data
        • ⇐ may have complex dependencies
      • different level of structuredness
        • e.g. structured, semistructured, nonstructured 
      • different data formats
      • data of different quality
      • different ownership 
  • {recommendation} consider ETL best practices
    • {best practice} define requirements upfront in a consolidated and consistent manner
      • allows to set clear expectations, consolidate the requirements, estimate the effort and costs, respectively get the sign-off
      • the requirements may involve all the aspects of the process
        • e.g. data extraction, data transformation, standard formatting, etc.
    • {best practice} define a high level strategy
      • allows to define the road ahead, risks and other aspects 
      • allows to provide transparency
      • this may be part of a broader strategy that can be referenced 
    • {best practice} align the requirements and various aspects to the existing strategies existing in the organization
      • allows to consolidate the various efforts and make sure that the objectives, goals and requirements are aligned
      • e.g. IT, business, Information Security, Data Management strategies
    • {best practice} define the scope upfront
      • allows to better estimate the effort and validate the outcomes
      • even if the scope may change in time, this allows to provide transparence and used as basis for the time and costs estimations
    • {best practice} manage the effort as a project and use a suitable Project Management methodology
      • allows to apply structured well-established PM practices 
      • it might be suited to adapt the methodology to project's particularities 
    • {best practice} convert data to standard formats to standardize data processing
      • allows to reduce the volume of issues resulted from data type mismatches
      • applies mainly to dates, numeric or other values for which can be defined standard formats
    • {best practice} clean the data in the source systems, when  cost-effective
      • allows to reduces the overall effort, especially when this is done in advance
      • this should be based ideally on the scope
    • {best practice} define and enforce data ownership
      • allows to enforce clear responsibilities across the various processes
      • allows to reduce the overall effort
    • {best practice} document data dependencies
      • document the dependencies existing in the data at the various levels
    • {best practice} protocol data movement from source(s) to destination(s) in term of data volume
      • allows to provide transparence into the data movement process
      • allows to identify gaps in the data or broader issues
      • can be used for troubleshooting and understanding the overall data growth
  • {recommendation} consider proven systems, architectures and methodologies
    • allows to minimize the overall effort and costs associated with the process
Previous Post <<||>> Next Post

20 January 2025

🏭🗒️Microsoft Fabric: [Azure] Service Principals (SPN) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 20-Jan-2025

[Azure] Service Principal (SPN)  

  • {def} a non-human, application-based security identity used by applications or automation tools to access specific Azure resources [1]
    • can be assigned precise permissions, making them perfect for automated processes or background services
      • allows to minimize the risks of human error and identity-based vulnerabilities
      • supported in datasets, Gen1/Gen2 dataflows, datamarts [2]
      • authentication type 
        • supported only by [2]
          • Azure Data Lake Storage
          • Azure Data Lake Storage Gen2
          • Azure Blob Storage
          • Azure Synapse Analytics
          • Azure SQL Database
          • Dataverse
          • SharePoint online
        • doesn’t support
          • SQL data source with Direct Query in datasets [2]
  • when registering a new application in Microsoft Entra ID, a SPN is automatically created for the app registration [4]
    • the access to resources is restricted by the roles assigned to the SPN
      • ⇒ gives control over which resources can be accessed and at which level [4]
    • {recommendation} use SPN with automated tools [4]
      • rather than allowing them to sign in with a user identity  [4]
    • {prerequisite} an active Microsoft Entra user account with sufficient permissions to 
      • register an application with the tenant [4]
      • assign to the application a role in the Azure subscription [4]
      •  requires Application.ReadWrite.All permission [4]
  • extended to support Fabric Data Warehouses [1]
    • {benefit} automation-friendly API Access
      • allows to create, update, read, and delete Warehouse items via Fabric REST APIs using service principals [1]
      • enables to automate repetitive tasks without relying on user credentials [1]
        • e.g. provisioning or managing warehouses
        • increases security by limiting human error
      • the warehouses thus created, will be displayed in the Workspace list view in Fabric UI, with the Owner name of the SPN [1]
      • applicable to users with administrator, member, or contributor workspace role [3]
      • minimizes risk
        • the warehouses created with delegated account or fixed identity (owner’s identity) will stop working when the owner leaves the organization [1]
          • Fabric requires the user to login every 30 days to ensure a valid token is provided for security reasons [1]
    • {benefit} seamless integration with Client Tools: 
      • tools like SSMS can connect to the Fabric DWH using SPN [1]
      • SPN provides secure access for developers to 
        • run COPY INTO
          • with and without firewall enabled storage [1]
        • run any T-SQL query programmatically on a schedule with ADF pipelines [1]
    • {benefit} granular access control
      • Warehouses can be shared with an SPN through the Fabric portal [1]
        • once shared, administrators can use T-SQL commands to assign specific permissions to SPN [1]
          • allows to control precisely which data and operations an SPN has access to  [1]
            • GRANT SELECT ON <table name> TO <Service principal name>  
      • warehouses' ownership can be changed from an SPN to user, and vice-versa [3]
    • {benefit} improved DevOps and CI/CD Integration
      • SPN can be used to automate the deployment and management of DWH resources [1]
        •  ensures faster, more reliable deployment processes while maintaining strong security postures [1]
    • {limitation} default semantic models are not supported for SPN created warehouses [3]
      • ⇒ features such as listing tables in dataset view, creating report from the default dataset don’t work [3]
    • {limitation} SPN for SQL analytics endpoints is not currently supported
    • {limitation} SPNs are currently not supported for COPY INTO error files [3]
      • ⇐ Entra ID credentials are not supported as well [3]
    • {limitation} SPNs are not supported for GIT APIs. SPN support exists only for Deployment pipeline APIs [3]
    • monitoring tools
      • [DMV] sys.dm_exec_sessions.login_name column [3] 
      • [Query Insights] queryinsights.exec_requests_history.login_name [3]
      • Query activity
        • submitter column in Fabric query activity [3]
      • Capacity metrics app: 
        • compute usage for warehouse operations performed by SPN appears as the Client ID under the User column in Background operations drill through table [3]

References:
[1] Microsoft Fabric Updates Blog (2024) Service principal support for Fabric Data Warehouse [link]
[2] Microsoft Fabric Learn (2024) Service principal support in Data Factory [link]
[3] Microsoft Fabric Learn (2024) Service principal in Fabric Data Warehouse [link
[4] Microsoft Fabric Learn (2024) Register a Microsoft Entra app and create a service principal [link]
[5] Microsoft Fabric Updates Blog (2024) Announcing Service Principal support for Fabric APIs [link
 
Acronyms:
ADF - Azure Data Factory
API - Application Programming Interface
CI/CD - Continuous Integration/Continuous Deployment
DMV - Dynamic Management View
DWH - Data Warehouse
SPN - service principal
SSMS - SQL Server Management Studio

06 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VII: Things That Don't Work) [new feature]

Microsoft does relatively a good job in documenting what doesn't work in Microsoft Fabric's SQL Databases. There's a good overview available already in the documentation, though beyond this the current post lists my finding while testing the previously written code on this blog,

USE Database

The standard syntax allows to change via USE the database context to the specified database or database snapshot. Unfortunately, this syntax doesn't seem to be supported currently and unfortunately many scripts seem to abuse of it. Thus, the following line of code throws an error:

-- changing the context
USE master;
GO
USE tempdb;

"Msg 40508, Level 16, State 1, Line 1, USE statement is not supported to switch between databases. Use a new connection to connect to a different database"

However, one can use the 3-part naming convention to reference the various objects:

-- sys metadata - retrieving the database files

SELECT *
FROM tempdb.sys.database_files dbf
ORDER BY name;

Even if the tempdb is not listed in the sys.databases table, it's still available for querying, which can prove helpful for troubleshooting. 

DBCC commands 

The documentation warns that some DBCC commands won't work, though in some cases there are also alternatives. For example:

-- clearing the procedure cache via DBCC
DBCC FREEPROCCACHE;
Output:
"Msg 2571, Level 14, State 9, Line 1, User '<user>' does not have permission to run DBCC freeproccache."

Alternatively, one can use the following command, which seems to work:

-- clearing the procedure cash via ALTER
ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

CHECKDB, which checks the logical and physical integrity of all the objects in the specified database, can't be used as well:
 
-- Checking the logical and physical integrity of a database
DBCC CHECKDB();
Output:
"Msg 916, Level 14, State 2, Line 1, The server principal "..." is not able to access the database "..." under the current security context."

The same error message is received for CHECKTABLE, utility which checks the integrity of all the pages and structures that make up the table (or indexed view):

-- checking a table's integrity
DBCC CHECKTABLE ('SalesLT.Address');
Output:
"Msg 916, Level 14, State 2, Line 2, The server principal "..." is not able to access the database "..." under the current security context."

A similar error messages is received for SQLPERF, which provides transaction log space usage statistics for all databases:

-- retrieving the LOGSPACE information for all databases
DBCC SQLPERF (LOGSPACE);
Output: 
"Msg 297, Level 16, State 10, Line 1, The user does not have permission to perform this action."

There are however DBCC commands like SHOW_STATISTICS or SHRINKDATABASE which do work. 
 
-- current query optimization statistics
DBCC SHOW_STATISTICS('SalesLT.Address','PK_Address_AddressID');
Output:
Name Updated Rows Rows Sampled Steps Density Average key length String Index Filter Expression Unfiltered Rows Persisted Sample Percent
PK_Address_AddressID Dec 21 2024 3:02AM 450 450 197 1 4 NO 450 0

SHRINKDATABASE shrinks the size of the data and log files in the specified database:

-- shrinking database
DBCC SHRINKDATABASE([AdventureWorks01-...]) WITH NO_INFOMSGS;

To be updated...

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: USE <database> [link]
[2] Microsoft Learn (2024) Database console commands [link]

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VI: Index Usage Analysis) [new feature]

There are several system dynamic management views (DMV) available in SQL Server, Azure SQL Server and now in SQL databases that allow to gather more information about indexes' fragmentation and usage. Let's look at the most important information available based on the indexes create in the previous posts. As the data were probably purged from the views, it's needed to run first the select queries based on the SalesLT.Product from the previous post. This step is important, otherwise the DMVs might return no records!

One starting point is to use the sys.dm_db_index_physical_stats DMV to look at the indexes' size and fragmentation information for a given table (or view). The table is used usually as starting point for analyzing indexes' fragmentation and then defragment the indexes with high fragmentation.

-- sys metadata - index & data size and fragmentation information for the data and indexes of the specified table or view
SELECT --db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IPS.page_count
, IPS.record_count
, IPS.index_level
, Cast(IPS.avg_fragmentation_in_percent as decimal(10,2)) avg_fragmentation_perc
, Cast(IPS.avg_page_space_used_in_percent as decimal(10,2)) space_used_perc
--, IPS.*
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_physical_stats(DB_ID(), IND.object_id, IND.index_id, NULL, 'DETAILED') IPS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
index_name type_desc page_count record_count index_level avg_fragmentation_perc space_used_perc
PK_Product_ProductID CLUSTERED 101 295 0 0.99 87.90
PK_Product_ProductID CLUSTERED 1 101 1 0.00 16.20
AK_Product_rowguid NONCLUSTERED 2 295 0 50.00 74.69
AK_Product_rowguid NONCLUSTERED 1 2 1 0.00 0.59
AK_Product_ProductNumber NONCLUSTERED 2 295 0 50.00 85.79
AK_Product_ProductNumber NONCLUSTERED 1 2 1 0.00 0.49
AK_Product_Name NONCLUSTERED 3 295 0 33.33 87.32
AK_Product_Name NONCLUSTERED 1 3 1 0.00 1.67
IX_SalesLT_Product_Color NONCLUSTERED 1 295 0 0.00 79.24
IX_SalesLT_Product_Color_Size NONCLUSTERED 1 295 0 0.00 94.12
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 4 295 0 0.00 86.60
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 1 4 1 0.00 1.01

In a second step one can look at the sys.dm_db_index_usage_stats DMV which provides the counts of the different types of index operations and the time each type of operation was last performed:

-- sys metadata - counts of different types of index operations and the time each type of operation was last performed.
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name
, IND.type_desc
, IUS.user_seeks 
, IUS.user_scans
, IUS.user_lookups 
, IUS.user_updates
, IUS.last_user_seek
, IUS.last_user_scan 
, IUS.last_user_lookup
, IUS.last_user_update
FROM sys.dm_db_index_usage_stats IUS
     JOIN sys.indexes IND
       ON IUS.index_id = IND.index_id
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product');
Output:
name type_desc user_seeks user_scans user_lookups user_updates last_user_seek last_user_scan last_user_lookup
PK_Product_ProductID CLUSTERED 0 10 15 0 2025-01-06T14:23:54 2025-01-06T14:23:54
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 2025-01-06T14:23:54
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 2025-01-06T13:38:03

Finally, it might be useful to look also at the sys.dm_db_index_operational_stats DMV which returns the current lower-level I/O, locking, latching, and access method activity for each partition of a table or index in the database (see the documentation for the full list of attrbutes):

-- sys metadata - index operations stats
SELECT -- db_name() db_name
--, object_name(IND.object_id) table_name
 IND.name index_name
, IND.type_desc
, IOS.range_scan_count
, IOS.singleton_lookup_count
, IOS.leaf_insert_count
, IOS.leaf_delete_count
, IOS.leaf_update_count
, IOS.nonleaf_insert_count
, IOS.nonleaf_delete_count
, IOS.nonleaf_update_count
FROM sys.indexes IND
     CROSS APPLY sys.dm_db_index_operational_stats(DB_ID(), IND.object_id, IND.index_id, NULL) IOS
WHERE IND.object_id = OBJECT_ID(N'SalesLT.Product')
 AND IOS.range_scan_count<>0
ORDER BY IND.name;
Output:
index_name type_desc range_scan_count singleton_lookup_count leaf_insert_count leaf_delete_count leaf_update_count nonleaf_insert_count nonleaf_delete_count nonleaf_update_count
IX_SalesLT_Product_Color_Size NONCLUSTERED 11 0 0 0 0 0 0 0
IX_SalesLT_Product_ListPrice_IC NONCLUSTERED 8 0 0 0 0 0 0 0
PK_Product_ProductID CLUSTERED 10 64 0 0 0 0 0 0

For more information on these DMVs check the documentation.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) SQL Server: sys.dm_db_index_physical_stats [link]
[2] Microsoft Learn (2024) SQL Server: sys.dm_db_index_usage_stats [link]
[3] Microsoft Learn (2024) SQL Server: sys.dm_db_index_operational_stats [link]

04 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part V: Manual Index Maintenance) [new feature]

Indexes' maintenance in Microsoft Fabric's SQL databases is supposed to happen automatically in the background via automatic tuning options feature, though the whole functionality is still in its early phases, and therefore many questions regarding the whole process may arise. Probably the most important question is whether indexes can still be created, respectively maintained manually. That's useful for temporary or even periodic workloads, where maybe organizations might still want to maintain indexes manually. 

The tests made below are based on the SalesLT.Product from AdventureWorkds database available in Microsoft Fabric. The target was to create several indexes that could be used for the various testing purposes. Each set of the below scripts was run 5-10 times until records appeared in the sys.dm_db_missing_index_details table for each test case (see further below):

-- batch 1: filter on single column (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'

-- batch 2: filter on two columns (to be run 5-10 times)
SELECT *
FROM SalesLT.Product 
WHERE Color = 'Red'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'Black'
  AND Size = '58'

SELECT *
FROM SalesLT.Product 
WHERE Color = 'White'
     AND Size = '58'

-- batch 3: filter with column selection (to be run 5-10 times)
SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 50 AND 55

SELECT ProductNumber, Name, Color, ListPrice
FROM SalesLT.Product 
WHERE ListPrice BETWEEN 100 and 105

Once the scripts run, one can look at the records created in the above considered dynamic management view:

-- sys metadata -  missing indexes
SELECT MID.statement AS table_name
, MID.equality_columns
, MID.inequality_columns
, MID.included_columns
--, MIG.index_group_handle
--, MIG.index_handle
FROM sys.dm_db_missing_index_details MID 
    JOIN sys.dm_db_missing_index_groups MIG 
     ON MID.index_handle =  MIG.index_handle
ORDER BY MIG.index_group_handle
, MIG.index_handle
Output:
table_name equality_columns inequality_columns included_columns
[AdventureWorks01-...].[SalesLT].[Product] [Color]
[AdventureWorks01-...].[SalesLT].[Product] [Color], [Size]
[AdventureWorks01-...].[SalesLT].[Product] [ListPrice] [Name], [ProductNumber], [Color]

The next step is to create one of the indexes (please note that database's name must be replaced accordingly or used only the 2-part naming convention - schema & table name ):

-- create index on Color
CREATE INDEX IX_SalesLT_Product_Color 
ON [AdventureWorks01-...].[SalesLT].[Product] (Color);

Once the script was run, all the records related to the SalesLT.Product disappeared from the dynamic management view. Therefore, it might be a good idea to take a snapshot with view's data before creating any indexes manually. Probably the same behavior should be expected when the indexes are created by the system.

-- create index on Color & Size
CREATE INDEX IX_SalesLT_Product_Color_Size
ON [SalesLT].[Product] (Color, Size);

-- create index on ListPrice with included columns
CREATE INDEX IX_SalesLT_Product_ListPrice_IC
ON [SalesLT].[Product] (ListPrice) INCLUDE(ProductNumber, Name, Color);

One can use the following query based on the meta.vIndexes (created in a previous post) to look at the indexes created:

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.table_name = 'Product'
  AND IND.index_name IN ('IX_SalesLT_Product_Color ','IX_SalesLT_Product_Color_Size'
,'IX_SalesLT_Product_ListPrice_IC')
ORDER BY IND.table_name
, IND.index_name
Output:
db_name schema_name table_name index_name index_type principal_type auto_created
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_Color_Size NONCLUSTERED S False
AdventureWorks01-... SalesLT Product IX_SalesLT_Product_ListPrice_IC NONCLUSTERED S False

After this model can be created further indexes as needed. It's always a good idea to take a "copy" of the indexes created (or keep a history of the scripts run for indexes' maintenance). This best practice is now more important, when the system can drop indexes as it considers fit. 

Don't forget to clean up the changes made if the indexes aren't needed anymore:

-- cleaning after
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_Color_Size;
DROP INDEX IF EXISTS SalesLT.IX_SalesLT_Product_ListPrice_IC;

So, after these tests, the standard syntax for index's maintenance seems to work also on SQL databases, with all the implications deriving from this (e.g. porting of scripts, database objects, etc.)

Happy coding!

Previous Post <<||>> Next Post

01 January 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IV: Automatic Tuning Options) [new feature]

Automatic tuning in SQL databases, respectively in SQL Azure datsbases, is a fully managed performance service that uses built-in intelligence to continuously monitor queries executed and automatically improve their performance [1]. At least in SQL databases the target is to fully automate indexes' maintenance by providing automated tuning, verification of performance gains, rollback and self-correction, respectively tuning history. The future sounds promising, though the question is what's already available. 

The documentation references several objects made available already for this feature. sys.database_automatic_tuning_options returns the tuning options available for the database:

-- SQL databases - automatic tuning options
SELECT ATO.name
, ATO.desired_state
, ATO.desired_state_desc
, ATO.actual_state
, ATO.actual_state_desc
, ATO.reason
, ATO.reason_desc
FROM sys.database_automatic_tuning_options ATO
ORDER BY ATO.name
Output:
name desired_state desired_state_desc actual_state actual_state_desc reason reason_desc
CREATE_INDEX 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
DROP_INDEX 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
FORCE_LAST_GOOD_PLAN 2 DEFAULT 1 ON 3 INHERITED_FROM_SERVER
MAINTAIN_INDEX 2 DEFAULT 0 OFF 3 INHERITED_FROM_SERVER

For further information see the automatic tuning options in the documentation [2], respectively [1] for an overview of the feature. 

There's a sys.database_automatic_tuning_mode and a sys.database_automatic_tuning_configurations, though on the SQL database instance only the first has records.

-- SQL databases - automatic tuning mode
SELECT ATM.desired_state
, ATM.desired_state_desc
, ATM.actual_state
, ATM.actual_state_desc
FROM sys.database_automatic_tuning_mode ATM
Output:
desired_state desired_state_desc actual_state actual_state_desc
1 INHERIT 3 AUTO

Attempting to modify the above settings at database level via the ALTER DATABASE leads to the following error message: 

-- disabling database properties
ALTER DATABASE [AdventureWorks01]
SET AUTOMATIC_TUNING ( FORCE_LAST_GOOD_PLAN = OFF);
Output:
"Msg 16202, Level 16, State 162, Line 1, Keyword or statement option 'SET' is not supported on the 'Microsoft Fabric' platform."

At least for the moment there seems to be no features available to change these settings. 

The automatic index tab report trom the Performance Dashboard for SQL database shows the history and status of automatically created indexes [5]. See Home toolbar in the Query Editor window >> Performance summary >> Automatic indexes. 

What it's not clear is how the database engine balances between index coverage and performance. How long does it take until the engine identifies the first missing index scenario and index's creation, respectively between the drop of an unused index and reoccurence of the same scenario that lead to index's creation. Moreover, what happens with index fragmentation? The documentation doesn't seem to provide the answers that probably only the hand-on experience can provide. Even if AI-based features are used in indexes' maintenance, it's still hard to grasp what lies beyond the various features. 

Are DBAs comfortable enough to relinquish control over index maintenence? It will be interesting to see their feedback. Probably, more control over what the engine does is needed, as sometimes it's enough to have 2-3 major exceptions for a solution to become not feasible. Usually the devil lies in details. 

Migrating from SQL Server or Azure SQL to SQL databases requires some degree of reengineering, probably with more effort and redesign for the first scenario, given the functionality gap. 

Taking a look at the indexes already available in AdventureWorks database, there seem no new indexes created since database's creation (see the definition of the referenced object):

-- sys metadata - index columns
SELECt IND.db_name
, IND.schema_name
, IND.table_name
, IND.index_name
, IND.index_type
, IND.principal_type
, IND.auto_created
FROM meta.vIndexes IND
WHERE IND.schema_name = 'SalesLT'
  AND IND.auto_created = 1
ORDER BY IND.table_name
, IND.index_name

Stay tuned! More tests on the way... 

Happy coding! 

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Automatic tuning [link]
[2] Microsoft Learn (2024) Performance Dashboard for SQL database in Microsoft Fabric [link]
[3] Microsoft Learn (2024) ALTER DATABASE SET options (Transact-SQL) [link]
[4] Microsoft Learn (2024) Enable automatic tuning in the Azure portal to monitor queries and improve workload performance [link]

25 December 2024

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part II: Under the Hood) [new feature]

The first tests performed with Microsoft Fabric's SQL databases (see previous post) revealed that with small exceptions many of the objects and administration scripts I built for SQL Server 2005 up to 2022 worked until now with minimal changes, though the tests made focused only on the standard basic functionality - the use of user-defined functions, stored procedures, views, respectively DMVs. As usual, the devil probably hides in details.

First of all, one can easily check current database's configuration settings via the sys.configurations:

-- database configuration values
SELECT cnf.configuration_id
, cnf.name
, cnf.description
, cnf.value 
, cnf.minimum 
, cnf.maximum 
, cnf.value_in_use 
, cnf.is_dynamic 
, cnf.is_advanced 
FROM sys.configurations cnf
WHERE value <> 0 
--WHERE name LIKE 'data%'
ORDER BY cnf.name 

One can look also at the database-scoped configuration via the sys.database_scoped_configurations, especially at the records whose value deviates from the default:

-- SQL database - database scoped configuration
SELECT name
, value
, value_for_secondary
, is_value_default 
FROM sys.database_scoped_configurations
WHERE is_value_default <>1 --search for deviations from default
--WHERE name=N'MAXDOP' -- search for individual values
ORDER BY name
Output:
name value value_for_secondary is_value_default
MAXDOP 8 False

Querying the sys.databases revealed that there seems to be only one system database available, namely the master database used to retrieve the definition of the system objects. In what concerns the various properties, it's probably important to check the compatibility level, the collation name, and other settings that apply:

-- database information
SELECT db.database_id
, db.[name] AS database_name
, db.recovery_model_desc 
, db.log_reuse_wait_desc 
, db.[compatibility_level] 
, db.page_verify_option_desc 
, db.collation_name
, db.user_access_desc
, db.is_change_feed_enabled
, db.is_data_lake_replication_enabled
FROM sys.databases AS db
Output:
database_id database_name recovery_model_desc log_reuse_wait_desc compatibility_level page_verify_option_desc collation_name user_access_desc is_change_feed_enabled is_data_lake_replication_enabled
1 master FULL NOTHING 160 CHECKSUM SQL_Latin1_General_CP1_CI_AS MULTI_USER False False
26 AdventureWorks01-... FULL NOTHING 160 CHECKSUM SQL_Latin1_General_CP1_CI_AS MULTI_USER True True

Unfortunately, it's not possible to change a database's collation, though other collations on individual columns are supported [2]. Compared to the standard SQL Server, it's not possible to change the compatibility level to a previous value. It will be interesting to see whether an SQL database will use always the latest version of SQL Server or whether future compatibility levels are supported in parallel.

There is no tempdb listed, though querying directly the objects from tempdb by using 3-part references allows to retrieve their content. For example, the following query retrieves the various database files available currently:

-- SQL database: tempdb files 
SELECT dbf.file_id
, dbf.name file_name
--, dbf.physical_name
, dsp.name file_group
--, type 
, dbf.type_desc file_type
--, dbf.growth growth_kb
, Cast(dbf.growth/128.0  as decimal(18,2)) growth_mb
--, dbf.is_percent_growth
--, dbf.max_size max_size_kb
, Cast(NullIf(dbf.max_size, -1)/128.0  as decimal(18,2)) max_size_mb
--, dbf.size file_size_kb
, Cast(dbf.size/128.0 as decimal(18,2)) file_size_mb
, dbf.state_desc 
, dbf.is_read_only 
FROM tempdb.sys.database_files dbf
     LEFT JOIN tempdb.sys.data_spaces dsp
       ON dbf.data_space_id = dsp.data_space_id
ORDER BY dbf.Name
Output:
file_id file_name file_group file_type growth_mb max_size_mb file_size_mb state_desc is_read_only
1 tempdev PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
11 tempdev10 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
12 tempdev11 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
13 tempdev12 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
14 tempdev13 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
15 tempdev14 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
16 tempdev15 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
17 tempdev16 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
3 tempdev2 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
4 tempdev3 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
5 tempdev4 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
6 tempdev5 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
7 tempdev6 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
8 tempdev7 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
9 tempdev8 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
10 tempdev9 PRIMARY ROWS 256.00 32768.00 16.00 ONLINE False
2 templog LOG 64.00 216256.00 16.00 ONLINE False

By removing the reference to tempdb from the query, one can get the same information for the current database. For the SalesLT were created only two database and one log file. 

There seems to be no msdb database, used primarily by the SQL Server Agent for scheduling alerts and jobs. This implies that the functionality based on it won't work and one needs to find alternatives!

Looking at the feature comparison from [2], the most important are the lack of support for always encrypted, auditing, bulk inserts, change data capture (CDC), elastic queries, full-text search, in-memory optimization, ledger, OPENROWSET. In addition it supports cross-database three-part name queries via the SQL analytics endpoint [2].

Until now, the most annoying fact is that in the web UI results are returned in different "pages", and thus makes it a bit more challenging to navigate the output. However, using a cursor to iterate through the results and saving the content to a table solves the problem (see link). The last query from the post was used to retrieve the system objects together with the number of records returned. 

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Microsoft Learn (2024) Frequently asked questions for SQL database in Microsoft Fabric (preview) [link]
[2] Microsoft Learn (2024) Features comparison: Azure SQL Database and SQL database in Microsoft Fabric (preview) [link]

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.