Showing posts with label Metadata. Show all posts
Showing posts with label Metadata. Show all posts

13 February 2025

🏭💠🗒️Microsoft Fabric: SQL Analytics Endpoint [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 12-Feb-2024

[Microsoft Fabric] SQL Analytics Endpoint

  • {def} a service that listens actively for requests providing a SQL-based experience for lakehouse Delta tables [1]
    • enables to query data in the lakehouse using T-SQL language and TDS protocol [1]
    • created by default for each lakehouses and mirrored databases provisioned in a workspace
      • every lakehouse has one SQL analytics endpoint [1]
    • manages the automatically generated tables so the workspace users can't modify them [1]
  • {feature} a background process is responsible for 
    • scanning lakehouse for changes [1]
      • automatically generates a table in the appropriate schema for every Delta table in the Lakehouse
        • tables are created with a minor delay [1]
        • data is refreshed automatically
          • the amount of time it takes to refresh the table is related to how optimized the Delta tables are [1]
        • {alternative} one can manually force a refresh of the automatic metadata scanning in the Fabric portal [1]
    • keeping SQL analytics endpoint up-to-date for all the changes committed to lakehouses in a workspace [1]
      • {characteristic} transparently managed
        • ⇐ the sync process is transparently managed by Microsoft Fabric platform [1]
        • when a change is detected in the lakehouse
          • a background process updates metadata [1]
          • the SQL analytics endpoint reflects the changes committed to lakehouse tables [1]
      • {characteristic} low latency
        • under normal operating conditions, the lag between a lakehouse and SQL analytics endpoint is less than one minute [1]
        • the actual length of time can vary from a few seconds to minutes [1]
      • the metadata synchronization is automatically triggered when the SQL Endpoint is opened [3]
        • ensures the SQL Analytics Endpoint remains up to date without the need for manual syncs [3]
          •  ⇐ provides a seamless experience [3]
  • {feature} can be enriched by adding database objects
    • schemas
    • views
    • procedures
    • other database objects
  • {feature} automatic metadata discovery
    • tracks changes committed to lakehouses [1]
      • is a single instance per Fabric workspace [1]
      • {issue} increased latency for changes to sync between lakehouses and SQL analytics endpoint, it could be due to large number of lakehouses in one workspace [1]
        • {resolution} migrate each lakehouse to a separate workspace [1]
          • allows automatic metadata discovery to scale [1]
      • {issue} changes committed to a lakehouse are not visible in the associated SQL analytics endpoint
        • create a new table in the  [1]
        • many committed rows may not be visible in the endpoint [1]
        • ⇐ may be impacted by ETL processing that generate large volumes of changes
        • {recommendation} initiate an on-demand metadata sync, triggered from the SQL query editor Refresh ribbon option [1]
          • forces an on-demand metadata sync, rather than waiting on the background metadata sync to finish [1]
      • {issue} if there's no maintenance scheduled for the parquet files, this can result in read overhead and this impacts time it takes to sync changes to SQL analytics endpoint [1]
        • {recommendation}schedule regular lakehouse table maintenance operations [1]
      • {limitation} not all Delta features are understood by the automatic sync process [1]
  • {feature} endpoint reprovisioning
    • if the initial provisioning attempt fails, users have the option to try again [2]
      • ⇐ without the need to create an entirely new lakehouse [2]
      • {benefit} empowers users to self-mitigate provisioning issues in convenient way in the UI avoiding the need for complete lakehouse re-creation [2]

References:
[1] Microsoft Learn (2024) Microsoft Fabric: SQL analytics endpoint performance considerations [link]
[2] Microsoft Learn (2024) Microsoft Fabric: What is the SQL analytics endpoint for a lakehouse? [link]
[3] Microsoft Learn (2024) Microsoft Fabric: What’s new in the Fabric SQL Analytics Endpoint? [link]

09 February 2025

🏭🗒️Microsoft Fabric: Data Pipelines [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2024

[Microsoft Fabric] Data pipeline

  • {def} a logical sequence of activities that orchestrate a process and perform together a task [1]
    • usually by extracting data from one or more sources and loading it into a destination; 
      • ⇐ often transforming it along the way [1]
      • ⇐ allows to manage the activities as a set instead of each one individually [2]
      • ⇐ used to automate ETL processes that ingest transactional data from operational data stores into an analytical data store [1]
      • e.g. lakehouse or data warehouse 
  • {concept} activity
    • {def} an executable task in a pipeline
      • a flow of activities can be defined by connecting them in a sequence [1]
      • its outcome (success, failure, or completion) can be used to direct the flow to the next activity in the sequence [1]
    • {type} data movement activities
      • copies data from a source data store to a sink data store [2]
    • {type} data transformation activities
      • encapsulate data transfer operations
        • incl. simple Copy Data activities that extract data from a source and load it to a destination
        • incl. complex Data Flow activities that encapsulate dataflows (Gen2) that apply transformations to the data as it is transferred
        • incl. notebook activities to run a Spark notebook
        • incl. stored procedure activities to run SQL code
        • incl. delete data activities to delete existing data
    • {type} control flow activities
        • used to 
          • implement loops
          • implement conditional branching
          • manage variables
          • manage parameter values
          • enable to implement complex pipeline logic to orchestrate data ingestion and transformation flow  [1]
        • can be parameterized
          • ⇐enabling to provide specific values to be used each time a pipeline is run  [1]
        • when executed, a run is initiated (aka data pipeline run
          • runs can be initiated on-demand or scheduled to start at a specific frequency
          • use the unique run ID to review run details to confirm they completed successfully and investigate the specific settings used for each execution [1]
      • {benefit} increases pipelines’ reusability
    • {concept} pipeline template
      • predefined pipeline that can be used and customize as required
    • {concept} data pipeline run
      • occurs when a data pipeline is executed
      • the activities in the data pipeline are executed to completion [3] 
      • can be triggered one of two ways
        • on-demand
        • on a schedule
          • the scheduled pipeline will be able to run based on the time and frequency set [3]

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Use Data Factory pipelines in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: Activity overview [link]
    [3] Microsoft Learn (2024) Microsoft Fabric Concept: Data pipeline Runs [link]

    Resources
    [R1] Metadata Driven Pipelines for Microsoft Fabric (link)
    [R2] 

    06 February 2025

    🌌🏭KQL Reloaded: First Steps (Part V: Database Metadata)

    When working with a new data repository, one of the first things to do is to look at database's metadata, when available, and try to get a birds eye view of what's available, how big is the databases in terms of size, tables and user-defined objects, how the schema was defined, how the data are stored, eventually how often backup are taken, what users have access and to what, etc. 

    So, after creating some queries in KQL and figuring out how things work, I tried to check what metadata are available, how it can be accessed, etc. The target is not to provide a full list of the available metadata, but to understand what information is available, in what format, how easy is to extract the important metadata, etc. 

    So, the first set of metadata is related to database:

    // get database metadata metadata
    .show databases (ContosoSales)
    
    // get database metadata metadata (multiple databases)
    .show databases (ContosoSales, Samples)
    
    // get database schema metadata
    .show databases (ContosoSales) schema
    
    // get database schema metadata (multiple databases) 
    .show databases (ContosoSales, Samples) schema
    
    // get database schema violations metadata
    .show database ContosoSales schema violations
    
    // get database entities metadata
    .show databases entities with (showObfuscatedStrings=true)
    | where DatabaseName == "ContosoSales"
    
    // get database metadata 
    .show databases entities with (resolveFunctionsSchema=true)
    | where DatabaseName == "ContosoSales" and EntityType == "Table"
    //| summarize count () //get the number of tables
    
    // get a function's details
    .show databases entities with (resolveFunctionsSchema=true)
    | where DatabaseName == "ContosoSales" 
        and EntityType == "Function" 
        and EntityName == "SalesWithParams"
    
    // get external tables metadata
    .show external tables
    
    // get materialized views metadata
    .show materialized-views
    
    // get query results metadata
    .show stored_query_results
    
    // get entities groups metadata
    .show entity_groups
    

    Then, it's useful to look at the database objects. 

    // get all tables 
    .show tables 
    //| count

    // get tables metadata
    .show tables (Customers, NewSales)
    
    // get tables schema
    .show table Customers cslschema
    
    // get schema as json
    .show table Customers schema as json
    
    // get table size: Customers
    Customers
    | extend sizeEstimateOfColumn = estimate_data_size(*)
    | summarize totalSize_MB=round(sum(sizeEstimateOfColumn)/1024.00/1024.00,2)
    

    Unfortunately, the public environment has restrictions in what concerns the creation of objects, while for the features available one needs to create some objects to query the corresponding metadata.

    Furthermore, it would be interesting to understand who has access to the various repositories, what policies were defined, and so on. 

    // get principal roles
    .show database ContosoSales principal roles
    
    // get principal roles for table
    .show table Customers principal roles
    
    // get principal roles for function:
    .show function SalesWithParams principal roles
    
    // get retention policies
    .show table Customers policy retention
    
    // get sharding policies
    .show table Customers policy sharding
    

    There are many more objects one can explore. It makes sense to document the features, respectively the objects used for the various purposes.

    In addition, one should check also the best practices available for the data repository (see [2]).

    Happy coding!

    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2024) Management commands overview [link]
    [2] Microsoft Learn (2024) Kusto: Best practices for schema management [link]

    22 January 2025

    🏭🗒️Microsoft Fabric: Clone Tables in Warehouses [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 22-Jan-2025

    [Microsoft Fabric] Zero-copy Clone

    • {def} a replica of an existing OneLake table created by copying existing table's metadata and referencing its data files [1]
      • the metadata is copied while the underlying data of the table stored as parquet files is not copied [1]
      • its creation is like creating a delta table [1]
      • DML/DDL changes on the source 
        • are not reflected in the clone table [1]
        • are not reflected on the source [1]
      • can be created within or across schemas in a warehouse [1]
      • created based on either:
        • current point-in-time
          • based on the present state of the table [1]
        • previous point-in-time
          • based on a point-in-time up to seven days in the past
            • the table clone contains the data as it appeared at a desired past point in time
            • all CRUD operations are retained for seven calendar days
          • created with a timestamp based on UTC
    • {characteristic} autonomous existence
      • the original source and the clones can be deleted without any constraints [1]
      • once a clone is created, it remains in existence until deleted by the user [1]
    • {characteristic} inherits 
      • object-level SQL security from the source table of the clone [1]
        • DENY permission can be set on the table clone if desired [1]
          • the workspace roles provide read access by default [1]
      • all attributes that exist at the source table, whether the clone was created within the same schema or across different schemas in a warehouse [1]
      • the primary and unique key constraints defined in the source table [1]
    • a read-only delta log is created for every table clone that is created within the Warehouse [1]
    • {benefit} facilitates development and testing processes 
      • by creating copies of tables in lower environments [1]
    • {benefit} provides consistent reporting and zero-copy duplication of data for analytical workloads and ML modeling and testing [1]
    • {benefit} provides the capability of data recovery in the event of a failed release or data corruption by retaining the previous state of data [1]
    • {benefit} helps create historical reports that reflect the state of data as it existed as of a specific point-in-time in the past [1]
    • {limitation} table clones across warehouses in a workspace are not currently supported [1]
    • {limitation} table clones across workspaces are not currently supported [1]
    • {limitation} clone table is not supported on the SQL analytics endpoint of the Lakehouse [1]
    • {limitation} clone of a warehouse or schema is currently not supported [1]
    • {limitation} table clones submitted before the retention period of seven days cannot be created [1]
    • {limitation} cloned tables do not currently inherit row-level security or dynamic data masking [1]
    • {limitation} changes to the table schema prevent a clone from being created prior to the table schema change [1]
    • {best practice} create the clone tables in dedicated schema(s)
    • [syntax] CREATE TABLE <schema.clone_table_name> AS CLONE OF <schema.table_name>

    Previous Post  <<||>> Next Post

    References:
    [1] Microsoft Learn (2023) Clone table in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Tutorial: Clone tables in the Fabric portal [link]
    [3] Microsoft Learn (2024) Tutorial: Clone a table with T-SQL in a Warehouse [link]
    [4] Microsoft Learn (2024) SQL: CREATE TABLE AS CLONE OF [link]

    17 January 2025

    💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part VIII: Permissions) [new feature]

    Data-based solutions usually target a set of users who (ideally) have restricted permissions to the functionality. Therefore, as part of the process are defined several personas that target different use cases, for which the permissions must be restricted accordingly. 

    In the simplest scenario the user must have access to the underlying objects for querying the data. Supposing that an Entra User was created already, the respective user must be given access also in the Fabric database (see [1], [2]). From database's main menu follow the path to assign read permissions:
    Security >> Manage SQL Security >> (select role: db_datareader)

    Manage SQL Security
    Manage SQL Security

    Manage access >> Add >> (search for User)

    Manage access
    Manage access

    (select user) >> Share database >> (select additional permissions) >> Save

    Manage additional permissions
    Manage additional permissions

    The easiest way to test whether the permissions work before building the functionality is to login over SQL Server Management Studio (SSMS) and check the access using the Microsoft Entra MFA. Ideally, one should have a User's credentials that can be used only for testing purposes. After the above setup was done, the new User was able to access the data. 

    A second User can be created for testing with the maximum of permissions allowed on the SQL database side, which is useful for troubleshooting. Alternatively, one can use only one User for testing and assign or remove the permissions as needed by the test scenario. 

    It's a good idea to try to understand what's happening in the background. For example, the expectation was that for the Entra User created above also a SQL user is created, which doesn't seem to be the case, at least per current functionality available. 

     Before diving deeper, it's useful to retrieve User's details: 

    -- retrieve current user
    SELECT SUser_Name() sys_user_name
    , User_Id() user_id 
    , USER_NAME() user_name
    , current_user [current_user]
    , user [user]; 
    
    Output:
    sys_user_name user_id user_name current_user user
    JamesClavell@[domain].onmicrosoft.com 0 JamesClavell@[domain].onmicrosoft.com JamesClavell@[domain].onmicrosoft.com JamesClavell@[domain].onmicrosoft.com

    Retrieving the current User is useful especially when testing in parallel functionality with different Users. Strangely, User's ID is 0 when only read permissions were assigned. However, a valid User identifier is added for example when to the User is assigned also the db_datawriter role. Removing afterwards the db_datawriter role to the User keeps as expected User's ID. For troubleshooting purposes, at least per current functionality, it might be a good idea to create the Users with a valid User ID (e.g. by assigning temporarily the db_datawriter role to the User). 

    The next step is to look at the Users with access to the database:

    -- database access 
    SELECT USR.uid
    , USR.name
    --, USR.sid 
    , USR.hasdbaccess 
    , USR.islogin
    , USR.issqluser
    --, USR.createdate 
    --, USR.updatedate 
    FROM sys.sysusers USR
    WHERE USR.hasdbaccess = 1
      AND USR.islogin = 1
    ORDER BY uid
    
    Output:
    uid name hasdbaccess islogin issqluser
    1 dbo 1 1 1
    6 CharlesDickens@[...].onmicrosoft.com 1 1 0
    7 TestUser 1 1 1
    9 JamesClavell@[...].onmicrosoft.com 1 1 0

    For testing purposes, besides the standard dbo role and two Entra-based roles, it was created also a SQL role to which was granted access to the SalesLT schema (see initial post):

    -- create the user
    CREATE USER TestUser WITHOUT LOGIN;
    
    -- assign access to SalesLT schema 
    GRANT SELECT ON SCHEMA::SalesLT TO TestUser;
      
    -- test impersonation (run together)
    EXECUTE AS USER = 'TestUser';
    
    SELECT * FROM SalesLT.Customer;
    
    REVERT; 

    Notes:
    1) Strangely, even if access was given explicitly only to the SalesLT schema, the TestUser User has access also to sys.sysusers and other DMVs. That's valid also for the access over SSMS
    2) For the above created User there are no records in the sys.user_token and sys.login_token DMVs, in contrast with the user(s) created for administering the SQL database. 

    Let's look at the permissions granted explicitly:

    -- permissions granted explicitly
    SELECT DPR.principal_id
    , DPR.name
    , DPR.type_desc
    , DPR.authentication_type_desc
    , DPE.state_desc
    , DPE.permission_name
    FROM sys.database_principals DPR
         JOIN sys.database_permissions DPE
    	   ON DPR.principal_id = DPE.grantee_principal_id
    WHERE DPR.principal_id != 0 -- removing the public user
    ORDER BY DPR.principal_id
    , DPE.permission_name;
    
    Result:
    principal_id name type_desc authentication_type_desc state_desc permission_name
    1 dbo SQL_USER INSTANCE GRANT CONNECT
    6 CharlesDickens@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT AUTHENTICATE
    6 CharlesDickens@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT CONNECT
    7 TestUser SQL_USER NONE GRANT CONNECT
    7 TestUser SQL_USER NONE GRANT SELECT
    9 JamesClavell@[...].onmicrosoft.com EXTERNAL_USER EXTERNAL GRANT CONNECT

    During troubleshooting it might be useful to check current user's permissions at the various levels via sys.fn_my_permissions:

    -- retrieve database-scoped permissions for current user
    SELECT *
    FROM sys.fn_my_permissions(NULL, 'Database');
    
    -- retrieve schema-scoped permissions for current user
    SELECT *
    FROM sys.fn_my_permissions('SalesLT', 'Schema');
    
    -- retrieve object-scoped permissions for current user
    SELECT *
    FROM sys.fn_my_permissions('SalesLT.Customer', 'Object')
    WHERE permission_name = 'SELECT';
    

    Notes:
    1) See also [1] and [4] in what concerns the limitations that apply to managing permissions in SQL databases.

    Happy coding!

    Previous Post <<||>> Previous Post

    References:
    [1] Microsoft Learn (2024) Microsoft Fabric: Share your SQL database and manage permissions [link]
    [2] Microsoft Learn (2024) Microsoft Fabric: Share data and manage access to your SQL database in Microsoft Fabric  [link]
    [3] Microsoft Learn (2024) Authorization in SQL database in Microsoft Fabric [link]
    [4] Microsoft Learn (2024) Authentication in SQL database in Microsoft Fabric [link]

    [5] Microsoft Fabric Learn (2025) Manage access for SQL databases in Microsoft Fabric with workspace roles and item permissions [link

    01 January 2025

    💎🏭SQL Reloaded: SQL Server Metadata (Part I: Indexes Overview)

    There are scenarios in which it's useful to gather information about the available indexes and their definition as a primary step for troubleshooting or index maintenance. Moreover, it's useful to take a baseline of the defined indexes and update it accordingly when indexes change. A minimum of information can be gathered in Excel files or similar repositories, though for the same can be used also metadata tools, at least when they're easy to use and the associated costs aren't neglectable. 

    Usually, there are two levels at which the information is needed - at index, respectively at column level. Sometimes it's useful to have an independent query for each level of detail, though in data warehouses and similar use cases it's useful to provide a model on top of the metadata, at least to improve query's maintainability.

    The first view encapsulates the logic needed to export the data and it's based on the sys.indexes, sys.objects, sys.schemas and sys.database_principals system objects. 

    -- create schema for metadata
    CREATE SCHEMA meta;
    
    -- clean after
    DROP VIEW IF EXISTS meta.vIndexes
    
    -- create views
    CREATE OR ALTER VIEW meta.vIndexes
    AS
    -- sys metadata - indexes
    SELECT OBJ.schema_id
    , IND.object_id 
    , IND.index_id 
    , SCH.name schema_name
    , OBJ.name table_name
    , IND.name index_name
    , IND.type_desc index_type
    , IND.is_primary_key
    , IND.is_unique_constraint
    , IND.fill_factor
    , IND.has_filter
    , IND.auto_created
    , IND.is_unique is_unique_index
    , DBP.type principal_type
    , DBP.type_desc principal_type_desc
    --, INC.*
    FROM sys.indexes IND WITH (NOLOCK)
         JOIN sys.objects OBJ WITH (NOLOCK)
           ON IND.object_id = OBJ.object_id
              JOIN sys.schemas SCH WITH (NOLOCK)
                ON OBJ.schema_id = SCH.schema_id
                   JOIN sys.database_principals DBP
                     ON SCH.principal_id = DBP.principal_id 
    WHERE DBP.type IN ('S', 'E')
    

    Upon case, the needed information can be exported via a query like the one below:

    -- sys metadata - index columns
    SELECt IND.db_name
    , IND.schema_name
    , IND.table_name
    , IND.index_name
    , IND.index_type
    , IND.principal_type
    FROM meta.vIndexes IND
    WHERE IND.schema_name = 'SalesLT'
      AND IND.table_name IN ('Address', 'Customer')
    ORDER BY IND.table_name
    , IND.index_name
    

    Output:

    db_name schema_name table_name index_name index_type principal_type
    AdventureWorks01... SalesLT Address AK_Address_rowguid NONCLUSTERED S
    AdventureWorks01... SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion NONCLUSTERED S
    AdventureWorks01... SalesLT Address IX_Address_StateProvince NONCLUSTERED S
    AdventureWorks01... SalesLT Address PK_Address_AddressID CLUSTERED S
    AdventureWorks01... SalesLT Customer AK_Customer_rowguid NONCLUSTERED S
    AdventureWorks01... SalesLT Customer IX_Customer_EmailAddress NONCLUSTERED S
    AdventureWorks01... SalesLT Customer PK_Customer_CustomerID CLUSTERED S

    Similarly, on top of the above view can be built a similar object that provides also the column-related information by adding the sys.index_columns and sys.columns system object to the logic:

    -- clean after
    DROP VIEW IF EXISTS meta.vIndexColumns
    
    -- create 
    CREATE OR ALTER VIEW meta.vIndexColumns
    AS
    -- sys metadata - index columns
    SELECT IND.db_id
    , IND.schema_id
    , INC.object_id 
    , INC.index_id 
    , INC.index_column_id
    , INC.column_id
    , IND.db_name
    , IND.schema_name
    , IND.table_name
    , IND.index_name
    , COL.name column_name
    , INC.key_ordinal
    , INC.partition_ordinal
    , IND.index_type
    , IND.is_primary_key
    , IND.is_unique_constraint
    , IND.fill_factor
    , IND.has_filter
    , IND.auto_created
    , IND.is_unique_index
    , INC.is_descending_key
    , INC.is_included_column
    , IND.principal_type
    , IND.principal_type_desc
    FROM sys.index_columns INC
         JOIN sys.columns COL WITH (NOLOCK)
           ON INC.object_id = COL.object_id 
          AND INC.column_id = COL.column_id 
         JOIN meta.vIndexes IND WITH (NOLOCK)
           ON INC.object_id = IND.object_id
          AND INC.index_id = IND.index_id
    

    And, there's an example of the query based on this view:

    -- sys metadata - index columns
    SELECt INC.schema_name
    , INC.table_name
    , INC.index_name
    , INC.column_name
    , INC.key_ordinal
    , INC.index_type
    , INC.principal_type
    FROM meta.vIndexColumns INC
    WHERE INC.schema_name = 'SalesLT'
      AND INC.table_name IN ('Address', 'Customer')
    ORDER BY INC.table_name
    , INC.index_name
    , INC.key_ordinal
    

    Output:
    schema_name table_name index_name column_name key_ordinal index_type principal_type
    SalesLT Address AK_Address_rowguid rowguid 1 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine1 1 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion AddressLine2 2 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion City 3 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion StateProvince 4 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion PostalCode 5 NONCLUSTERED S
    SalesLT Address IX_Address_AddressLine1_AddressLine2_City_StateProvince_PostalCode_CountryRegion CountryRegion 6 NONCLUSTERED S
    SalesLT Address IX_Address_StateProvince StateProvince 1 NONCLUSTERED S
    SalesLT Address PK_Address_AddressID AddressID 1 CLUSTERED S
    SalesLT Customer AK_Customer_rowguid rowguid 1 NONCLUSTERED S
    SalesLT Customer IX_Customer_EmailAddress EmailAddress 1 NONCLUSTERED S
    SalesLT Customer PK_Customer_CustomerID CustomerID 1 CLUSTERED S

    Notes:
    1) As a DBA it's useful to take a baseline of the indexes defined and reevaluate their usefulness over time. These are one of the checks that should be done when one becomes responsible for the administration of a database server, independently of the vendor.
    2) The definitions of the views can be extended as needed, though one should try to keep the overall complexity to a minimum. 
    3) There are voices against the use of NOLOCK. Feel free to change the objects accordingly!
    4) It's useful to work in a dedicated schema (e.g. meta) and have a different naming convention that deviates slightly from one defined by Microsoft. This should make sure that no confusion with the system objects exists.
    5) Azure SQL takes over many of the responsibilities for index maintenance. Even if indexes are managed automatically by the system, a baseline is still needed, at least to evaluate functionality's performance, respectively the changes that occurred in the environment. 
    6) Except attributes which were added for specific functionality over time, the queries should work at least starting with SQL Server 2005 forward.
    7) See also the notes on clustered vs nonclustered indexes.

    Happy coding!

    Previous Post <<||>> Next Post

    29 March 2024

    🗄️🗒️Data Management: Data [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources. 
    Last updated: 29-Mar-2024

    [Data Management] Data

    • {def} raw, unrelated numbers or entries that represent facts, concepts,  events,  and/or associations 
    • categorized by
      • domain
        • {type} transactional data
        • {type} master data
        • {type} configuration data
          • {subtype}hierarchical data
          • {subtype} reference data
          • {subtype} setup data
          • {subtype} policy
        • {type} analytical data
          • {subtype} measurements
          • {subtype} metrics
          • {subtype} 
      • structuredness
        • {type} structured data
        • {type} semi-structured data
        • {type} unstructured data
      • statistical usage as variable
        • {type} categorical data (aka qualitative data)
          • {subtype} nominal data
          • {subtype} ordinal data
          • {subtype} binary data
        • {type} numerical data (aka quantitative data)
          • {subtype} discrete data
          • {subtype} continuous data
      • size
        • {type} small data
        • {type} big data
    • {concept} transactional data
      • {def} data that describe business transactions and/or events
      • supports the daily operations of an organization
      • commonly refers to data created and updated within operational systems
      • support applications that automated key business processes 
      • usually stored in normalized tables
    • {concept} master data
      • {def}"data that provides the context for business activity data in the form of common and abstract concepts that relate to the activity" [2]
        • the key business entities on which transaction are executed
      • the dimensions around on which analysis is conducted
        • used to categorize, evaluate and aggregate transactional data
      • can be shared across more than one transactional applications
      • there are master data similar to most organizations, but also master data specific to certain industries 
      • often appear in more than one area within the business
      • represent one version of the truth
      • can be further divided into specialized subsets
      • {concept} master data entity
        • core business entity used in different applications across the organization, together with their associated metadata, attributes, definitions, roles, connections and taxonomies 
        • may be classified within a hierarchy
          • the way they describe, characterize and classify business concepts may actually cross multiple hierarchies in different ways
            • e.g. a party can be an individual, customer, employee, while a customer might be an individual, party or organization
      • do not change as frequent like transactional data
        • less volatile than transactional data 
        • there are master data that don’t change at all 
          • e.g. geographic locations
      • strategic asset of the business 
      • needs to be managed with the same diligence as other strategic assets
    • {concept} metadata 
      • {definition} "data that defines and describes the characteristics of other data, used to improve both business and technical understanding of data and data-related processes" [2]
        • data about data
      • refers to 
        • database schemas for OLAP & OLTP systems
        • XML document schemas
        • report definitions
        • additional database table and column descriptions stored with extended properties or custom tables provided by SQL Server
        • application configuration data
    • {concept} analytical data
      • {definition} data that supports analytical activities 
        • e.g. decision making, reporting queries and analysis 
      • comprises
        • numerical values
        • metrics
        • measurements
      • stored in OLAP repositories 
        • optimized for decision support 
        • enterprise data warehouses
        • departmental data marts
        • within table structures designed to support aggregation, queries and data mining 
    • {concept} hierarchical data 
      • {definition} data that reflects a hierarchy 
        • relationships between data are represented in hierarchies
      • typically appears in analytical applications
      • {concept} hierarchy
        • "a classification structure arranged in levels of detail from the broadest to the most detailed level" [2]
        • {concept} natural hierarchy
          • stem from domain-based attributes
          • represent an intrinsic structure of the dat
            • they are natural for the data
              • e.g. product taxonomy (categories/subcategories)
          • useful for drilling down from a general to a detailed level in order to find reasons, patterns, and problems
            • common way of analyzing data in OLAP applications
            • common way of filtering data in OLTP applications
        • {concept} explicit hierarchy
          • organize data according to business needs
          • entity members can be organized in any way
          • can be ragged
            • the hierarchy can end at different levels
        • {concept} derived hierarchy
          • domain-based attributes form natural hierarchies 
          • relationships between entities must already exist in a model
          • can be recursive
    • {concept} structured data
      • {definition} "data that has a strict metadata defined"
    • {concept} unstructured data 
      • {definition} data that doesn't follow predefined metadata
      • involves all kinds of documents 
      • can appear in a database, in a file, or even in printed material
    • {concept} semi-structured data 
      • {definition} structured data stored within unstructured data,
      • data typically in XML form
        • XML is widely used for data exchange
      • can appear in stand-alone files or as part of a database (as a column in a table)
      • useful when metadata (the schema) changes frequently, or there’s no need for a detailed relational schema

    References:
    [1] The Art of Service (2017) Master Data Management Course 
    [2] DAMA International (2011) "The DAMA Dictionary of Data Management", 

    06 February 2024

    💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Data Objects Metadata

    There seem to be four main sources for learning about the functionality available in the Delta Lake especially in what concerns the SQL dialect used by the Spark SQL: Databricks [1], Delta Lake [2], Azure Databricks [3], respectively the Data Engineering documentation in Microsoft Fabric [4] and the afferent certification material. Unfortunately, the latter focuses more on PySpark. So, until Microsoft addresses the gap, one can consult the other sources, check what's working and built thus the required knowledge for handling the various tasks. 

    First of all, it's important to understand which the data objects available in Microsoft Fabric are. Based on [5] I could identify the following hierarchy:


    According to the same source [5] the metastore contains all of the metadata that defines data objects in the lakehouse, while the catalog is the highest abstraction in the lakehouse. The database, called also a schema, keeps its standard definition - a collection of data objects, such as tables or views (aka relations) and functions. The tables, views and functions keep their standard definitions. Except the metastore, these are also the (securable) objects on which permissions can be set. 

    One can explore the structure in Spark SQL by using the SHOW command:

    -- explore the data objects from the lakehouse
    SHOW CATALOGS;
    
    SHOW DATABASES;
    
    SHOW SCHEMAS;
    
    SHOW CATALOGS;
    
    SHOW VIEWS;
    
    SHOW TABLES;
    
    SHOW FUNCTIONS;
    

    Moreover, one can list only the objects from an object from the parent (e.g. the tables existing in a database):
     
    -- all tables from a database
    SHOW TABLES FROM Testing;
    
    -- all tables from a database matching a pattern
    SHOW TABLES FROM Testing LIKE 'Asset*';
    
    -- all tables from a database matching multiple patterns
    SHOW TABLES FROM Testing LIKE 'Asset*|cit*';
    

    Notes:
    1) Same syntax applies for views and functions, respectively for the other objects in respect to their parents (from the hierarchy). 
    2) Instead of FROM one can use the IN keyword, though I'm not sure what's the difference given that they seem to return same results.
    3) For databases and tables one can use also the SQL Server to export the related metadata. Unfortunately, it's not the case for views and functions because when the respective objects are created in Spark SQL, they aren't visible over the SQL Endpoint, and vice versa.

    4) Given that there are multiple environments that deal with delta tables, to minimize the confusion that might result from their use, it makes sense to use database as term instead of schema. 

    References:
    [1] Databricks (2024) SQL language reference (link)
    [2] Delta Lake (2023) Delta Lake documentation (link)
    [3] Microsoft Learn (2023) Azure Databricks documentation (link)
    [4] Microsoft Learn (2023) Data Engineering documentation in Microsoft Fabric (link)
    [5] Databricks (2023) Data objects in the Databricks lakehouse (link)

    💎🏭SQL Reloaded: Microsoft Fabric's Delta Tables in Action - Table Metadata II (Updating a table's COMMENT attributes)

    While using the DESCRIBE TABLE metadata I observed that its output shown also a Comment attribute which was NULL for all the columns. Browsing through the Databricks documentation (see [1]), I found that the Comment can be provided in a delta table's definition as follows (code to be run in a notebook):

    -- drop the table (if already exists)
    DROP TABLE IF EXISTS Assets2;
    
    --create the table
    CREATE TABLE Assets2 (
     Id int NOT NULL COMMENT 'Asset UID',
     CreationDate timestamp NOT NULL COMMENT 'Creation Date',
     Vendor string NOT NULL COMMENT 'Vendors name',
     Asset string NOT NULL COMMENT 'Asset type',
     Model string NOT NULL COMMENT 'Asset model',
     Owner string NOT NULL COMMENT 'Current owner',
     Tag string NOT NULL COMMENT 'Assets tag',
     Quantity decimal(13, 2) NOT NULL COMMENT 'Quantity'
    ) 
    USING DELTA
    COMMENT 'Vendor assets';
    
    -- show table's definition
    DESCRIBE TABLE Assets2;
    
    -- show table's details
    DESCRIBE DETAIL Assets;
    

    So, I thought there must be a way to update Assets table's definition as well. And here's the solution split into two steps, as different syntax is required for modifying the columns, respectively the table:

    -- modify columns' COMMENT for an existing table
    ALTER TABLE Assets ALTER COLUMN ID COMMENT 'Asset UID';
    ALTER TABLE Assets ALTER COLUMN CreationDate COMMENT 'Creation Date';
    ALTER TABLE Assets ALTER COLUMN Vendor COMMENT 'Vendors name';
    ALTER TABLE Assets ALTER COLUMN Asset COMMENT 'Asset type';
    ALTER TABLE Assets ALTER COLUMN Model COMMENT 'Asset model';
    ALTER TABLE Assets ALTER COLUMN Owner COMMENT 'Current owner';
    ALTER TABLE Assets ALTER COLUMN Tag COMMENT 'Assets tag';
    ALTER TABLE Assets ALTER COLUMN Quantity COMMENT 'Quantity';
    
    -- show table's definition
    DESCRIBE TABLE Assets;
    

    The operation generated a log file with the following content:

    {"commitInfo":{"timestamp":1707180621522,"operation":"CHANGE COLUMN","operationParameters":{"column":"{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}"},"readVersion":13,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"42590d18-e04a-4fb2-bbfa-94414b23fb07"}}
    {"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
    {\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}},
    {\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}},
    {\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}},
    {\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}},
    {\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}},
    {\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}},
    {\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}},
    {\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}
    

    And the second step (see [2]):

    -- modify a table's COMMENT
    COMMENT ON TABLE Assets IS 'Vendor assets';
    
    -- describe table's details
    DESCRIBE DETAIL Assets;

        The operation generated a log file with the following content:

    {"commitInfo":{"timestamp":1707180808138,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"comment\":\"Vendor assets\"}"},"readVersion":14,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{},"engineInfo":"Apache-Spark/3.4.1.5.3-110807746 Delta-Lake/2.4.0.8","txnId":"21c315b5-a81e-4107-8a19-6256baee7bd5"}}
    {"metaData":{"id":"83e87b3c-28f4-417f-b4f5-842f6ba6f26d","description":"Vendor assets","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[
    {\"name\":\"Id\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset UID\"}}
    ,{\"name\":\"CreationDate\",\"type\":\"timestamp\",\"nullable\":false,\"metadata\":{\"comment\":\"Creation Date\"}}
    ,{\"name\":\"Vendor\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Vendors name\"}}
    ,{\"name\":\"Asset\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset type\"}}
    ,{\"name\":\"Model\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Asset model\"}}
    ,{\"name\":\"Owner\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Current owner\"}}
    ,{\"name\":\"Tag\",\"type\":\"string\",\"nullable\":false,\"metadata\":{\"comment\":\"Assets tag\"}}
    ,{\"name\":\"Quantity\",\"type\":\"decimal(13,2)\",\"nullable\":false,\"metadata\":{\"comment\":\"Quantity\"}}]}","partitionColumns":[],"configuration":{},"createdTime":1707149934697}}
    

    Notes:
    1) Don't forget to remove the Assets2 table!
    2) Updating the COMMENT for a single delta table is simple, though for updating the same for a list of tables might be task for a PySpark job. It makes sense to provide the respective metadata from the start, when that's possible. Anyway, the information should be available in lakehouse's data dictionary.
    3) One can reset the COMMENT to NULL or to an empty string.

    Happy coding!

    Resources:
    [1] Databaricks (2023) ALTER TABLE (link)
    [2] Databaricks (2023) COMMENT ON (link)
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.