09 April 2025

💠🛠️🗒️SQL Server: Tempdb Database [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 9-Apr-2025

[SQL Server 2005] Tempdb database

  • {def} system database available as a global resource to all users connected to a Database Engine instance [1]
    • does not persist after SQL Server shuts down 
      • created at restart using the specifications from the model database 
      • doesn’t need crash recovery
      • requires rollback capabilities
        • {scenario} when updating a row in a global temporary table and rolling back the transaction [12]
          • there's no way to undo this change unless was logged the ‘before value’ of the update [12]
          • there is no need to log the ‘after value’ of the update because this value is only needed when the transaction needs to be ‘redone’ which happens during database recovery [12]
        • {scenario} during an Insert operation
          • only redo information is needed
          • when insert a row in say a global temporary table, the actual ‘row value’ is not logged because SQL Server does not need the ‘row’ to undo except it needs to set the offsets within the page appropriately or if this insert had caused a new page to be allocate, de-allocate the page [12]
            • only redo information is needed
          • not all objects in TempDB are subject to logging [18]
    • {characteristic} critical database
      • ⇒ needs to be configured adequately 
      •  has the highest level of create and drop actions [5]
      • under high stress the allocation pages, syscolumns and sysobjects can become bottlenecks [5]
      • workers use tempdb like any other database
        • any worker can issue I/O to and from tempdb as needed [5]
    • if the tempdb database cannot be created, SQL Server will not start [7]
      • {recommendation} start SQL Server by using the (-f) startup parameter
  • files
    • data file:
      • {default} dempdb.mdf
    • log file
      • {default} templog.ldf
  • {operation} changing the tempdb database
    • {step} use ALTER database statement and MODIFY FILE clause to change the physical file names of each file in the tempdb database to refer to the new physical location [7]
    • {step} stop and restart SQL Server
  • {operation} create file group
    • only one file group for data and one file group for logs are allowed.
    • {default} the number of files is set to 1
    • can be a high contention point for internal tracking structures
      •  often it's better to create multiple files for tempdb
    • multiple files can be created for each file group
      • {recommendation} set the number of files to match the number of CPUs that are configured for the instance
        • ⇐ if the number of logical processors is less than or equal to eight [1]
          • otherwise, use eight data files [1]
          • contention is still observed, increase the number of data files by multiples of four until the contention decreases to acceptable levels, or make changes to the workload [1]
        • ⇐ it’s not imperative

      • {scenario}add a file per CPU (aka multiple files)
        • allows the file group manager to 
          • pick the next file 
            • there are some optimizations here for tempdb to avoid contention and skip to the next file when under contention [4]
          •  then use the target location to start locating free space [4]                
        • ⇐ SQL Server creates a logical scheduler for each of the CPUs presented to it (logical or physical)
          • allows each of the logical schedulers to loosely align with a file [4]
            • since can be only 1 active worker per scheduler this allows each worker to have its own tempdb file (at that instant) and avoid allocation contention  [4]        
        • internal latching and other activities achieve better separation (per file) 
          • ⇒ the workers don’t cause a resource contention on a single tempdb file
            • rare type of contention [5]
        • {misconception}
          • leads to new I/O threads per file [5]       
      • {scenario} add more data files 
        • may help to solve potential performance problems that are due to I/O operations. 
        • helps to avoid a latch contention on allocation pages
          •  manifested as a UP-latch)
      • {downside} having too many files 
        • increases the cost of file switching
        • requires more IAM pages
        • increases the manageability overhead. 
      • {recommendation} configure the size of the files to be equal
        • ⇐ to better use the allocation mechanism (proportional fill),
      • {recommendation} create tempdb files striped across fast disks 
    • the size of the tempdb affect the performance of the database 
    • {event} server restarts
      • the tempdb file size is reset to the configured value (the default is 8 MB).
    • {feature} auto grow 
      • temporary 
        • ⇐  unlike other types of databases)
      • when the tempdb grows, all transactional activity may come to an halt 
        • because TempDB is used by most operations and these activities will get blocked until more disk space gets allocated to the TempDB [12]
        • {recommendation} pre-allocate the size of the TempDB that matches the needs of workload [12]
      • {recommendation} should only be used more for exceptions rather than as a strategy [12]
    • transactions
      • lose the durability attribute
      • can be rolled back
        • there is no need to REDO them because the contents of tempdb do not persist across server restarts
          • because the transaction log does not need to be flushed, transactions are committed faster in tempdb than in user databases. In tempdb, transactions 
        • most internal operations on tempdb do not generate log records because there is no need to roll back
    • {restriction} add filegroups [1]
    • {restriction} remove the primary filegroup, primary data file, or log file [1]
    • {restriction} renaming the database or primary filegroup [1]
    • {restriction} back up or restore the database [1]
    • {restriction} change collation [1]
      • the default collation is the server collation [1]
    • {restriction} change the database owner [1]
      • owned by sa
    • {restriction} create a database snapshot [1]
    • {restriction} drop the database [1]
    • {restriction} drop the quest user from the database [1]
    • {restriction} enable CDC
    • {restriction} participate in database mirroring [1]
    • {restriction} can only be configured in the simple recovery model
      • ⇒ the value can’t be changed
    • {restriction} auto shrink is not allowed for tempdb
      • database shrink and file shrink capabilities are limited
        •  because many of the hidden objects stored in tempdb cannot be moved by shrink operations
    • {restriction} the database CHECKSUM option cannot be enabled
    • {restriction} database snapshots cannot be created
    • {restriction} DBCC CHECKALLOC and DBCC CHECKCATALOG are not supported.
      • only offline checking for DBCC CHECKTABLE is performed
        • ⇒ TAB-S lock is needed
        • here are internal consistency checks that occur when tempdb is in use
          • if the checks fail, the user connection is broken and the tempdb space used by the connection is freed. 
    • {restriction} set the database to OFFLINE
    • {restriction} set the database or primary filegroup to READ_ONLY [1]
  • used by
    • {feature} query
    • {feature} temporary tables
    • {feature} table variables
    • {feature} table-valued functions
    • {feature} user-defined functions
    • {feature} online/offline index creation
    • {feature} triggers 
    • {feature} cursors
    • {feature} RCSI
    • {feature} MARS
    • {feature} DBCC CHECK
    • {feature} LOB parameters 
    • {feature} Service Broker and event notification
    • {feature} XML and LOB variable
    • {feature} query notifications
    • {feature} database mail
  • used to store
    • user objects
      • objects explicitly created
        • user objects that can be created in a user database can also be created in tempdb [1]
          • {limitation} there's no durability guarantee [1]
          • {limitation} dropped when the Database Engine instance restarts [1]
      • {type} global and local temporary tables 
        • correspond to ## and # tables and table variables created explicitly by the application [12]
          • REDO information is not logged
      • {type} indexes on global and local temporary tables
      • {type} table variables
      • {type} tables returned in table-valued functions
      • {type} cursors
      • improved caching for temporary objects
    • temp objects
      • only cached when none of the following conditions is violated: [12]
        • named constraints are not created
      • DDL statements that affect the table are not run after the temp table has been created
        • e.g. CREATE INDEX or CREATE STATISTICS statements
      • not created by using dynamic SQ
        • e.g. sp_executesql N'create table #t(a int)'
    • internal objects
      • objects created internally by SQL Server
        • each object uses a minimum of nine pages [1]
          • an IAM page and an eight-page extent [1]
      • {type} work tables
        • store intermediate results for spools, cursors, sorts, and temporary LOB storage [1]
      • {type} work files
        • used for for hash join or hash aggregate operations [1]
      • {type} intermediate sort results
        • used for operations 
          • creating or rebuilding indexes
            • if SORT_IN_TEMPDB is specified
          • certain GROUP BY, ORDER BY
          • UNION queries
      • {restriction} applications cannot directly insert into or delete rows from them
      • {restriction} metadata is stored in memory
      • {restriction} metadata does not appear in system catalog views such as sys.all_objects.
      • {restriction} considered to be hidden objects
      • {restriction} updates to them do not generate log records
      • {restriction} page allocations do not generate log records unless on a sort unit. If the statement fails, these objects are deallocated. 
      • each object occupies at least nine pages (one IAM page and eight data pages) in tempdb
      • used
        • to store intermediate runs for sort
        • to store intermediate results for hash joins and hash aggregates.
        • to store XML variables or other LOB data type variables
          • e.g. text, image, ntext, varchar(max), varbinary(max), and all others
        • by queries that need a spool to store intermediate results
        • by keyset cursors to store the keys
        • by static cursors to store a query result
        • by Service Broker to store messages in transit
        • by INSTEAD OF triggers to store data for internal processing
        • by any feature that uses the above mentioned operations 
    • version stores
      • collections of data pages that hold the data rows that support row versioning [1]
      • {type} common version store
      • {type} online index build version store
  • {} storage
    • I/O characteristics
      • {requirement} read after rights
        • {def} the ability of the subsystem to service read requests with the latest data image when the read is issued after any write is successfully completed [7]
      • {recommendation} writer ordering
        • {def} the ability of the subsystem to maintain the correct order of write operations [7]
      • {recommendation} torn I/O prevention
        • {def} the ability of the system to avoid splitting individual I/O requests [7]
      • {requirement} physical sector alignment and size
        • devices are required to support sector attributes permitting SQL Server to perform writes on physical sector-aligned boundaries and in multiples of the sector size [7]
    • can be put on specialty systems
      • ⇐  they can’t be used for other databases
      • RAM disks
        • non-durable media
        • support double RAM cache
          • one in the buffer pool and one on the RAM disk
            •  directly takes away from the buffer pool’s total possible size and generally decreases the performance [7]
        • give up RAM
          • implementations of RAM disks and RAM-based files caches
      • solid state
        • high speed subsystems
        • {recommendation} confirm with the product vendor to guarantee full compliance with SQL Server I/O needs [7]
    • sort errors were frequently solved by moving tempdb to a non-caching local drive or by disabling the read caching mechanism [8]
  • {feature} logging optimization 
    • avoids logging the "after value" in certain log records in tempdb
      • when an UPDATE operation is performed without this optimization, the before and after values of the data are recorded in the log file 
    • can significantly reduce the size of the tempdb log as well as reduce the amount of I/O traffic on the tempdb log device
  • {feature} instant data file initialization 
    • isn’t zeroing out the NTFS file when the file is created or when the size of the file is increased
    • {benefit} minimizes overhead significantly when tempdb needs to auto grow. 
      • without this, auto grow could take a long time and lead to application timeout.
    • reduces the impact of database creation [5]
      • because zero’s don’t have to be stamped in all bytes of a database file, only the log files [5]
        • reduces the gain from using multiple threads during database creation. [5]
  • {feature} proportional fill optimization
    • reduces UP latch contention in tempdb
    • when there are multiple data files in tempdb, each file is filled in proportion to the free space that is available in the file so that all of the files fill up at about the same time
      •  accomplished by removing a latch that was taken during proportional fill.
  • {feature} deferred drop in tempdb
    • when a large temporary table is dropped by an application, it is handled by a background task and the application does not have to wait
      • ⇒ faster response time to applications.
  • {feature} worktable caching 
    • {improvement} when a query execution plan is cached, the work tables needed by the plan are not dropped across multiple executions of the plan but merely truncated
      • ⇐ in addition, the first nine pages for the work table are kept.
  • {feature} caches temporary objects
    • when a table-valued functions, table variables, or local temporary tables are used in a stored procedure, function, or trigger, the frequent drop and create of these temporary objects can be time consuming
      •  this can cause contentions on tempdb system catalog tables and allocation pages 
Previous Post <<||>> Next Post

References
[1] Microsoft Learn (2024) SQL Server: Tempdb database [link]
[2] Wei Xiao et al (2006) Working with tempdb in SQL Server 2005
[4] CSS SQL Server Engineers (2009) SQL Server TempDB – Number of Files – The Raw Truth [link]
[5] SQL Server Support Blog (2007) SQL Server Urban Legends Discussed [link]
[6] Microsoft Learn (2023) SQL Server: Microsoft SQL Server I/O subsystem requirements for the tempdb database [link]
[7] Microsoft Learn (2023) SQL Server: Microsoft SQL Server I/O subsystem requirements for the tempdb database [link]
[8] Microsoft Learn (2023) SQL Server: SQL Server diagnostics detects unreported I/O problems due to stale reads or lost writes [link]
[12a] SQL Server Blog (2008) Managing TempDB in SQL Server: TempDB Basics (Version Store: Simple Example), by Sunil Agarwal [link]
[12b] SQL Server Blog (2008) Managing TempDB in SQL Server: TempDB Basics (Version Store: logical structure), by Sunil Agarwal [link]
[18] Simple Talk (2020) Temporary Tables in SQL Server, by Phil Factor [link]
[19] Microsoft Learn (2023) SQL Server: Recommendations to reduce allocation contention in SQL Server tempdb database [link]

Acronyms:
DB - database
DDL - Data Definition Language
I/O - Input/Output
LOB - large object
MARS - Multiple Active Result Sets
NTFS - New Technology File System
RAM - Random-Access Memory
RCSI - Read Committed Snapshot Isolation

07 April 2025

🏭🗒️Microsoft Fabric: User Data Functions (UDFs) in Python [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 7-Apr-2025

[Microsoft Fabric] User Data Functions (UDFs)
  • {def} a platform that allows customers to host and run custom logic on Fabric from different types of items and data sources [1]
  • empower data developers to write custom logic and embed it into their Fabric ecosystem [1]
    • the logic can include internal algorithms and libraries [1]
    • {benefit} allows to handle complex transformations, optimizing performance, and integrating diverse data sources beyond the capabilities of low-code tools [5]
      • via public libraries from PyPI
      • via native Fabric integrations to 
        • connect to Fabric data sources
          • e.g. warehouse, lakehouse or SQL Database
        • invoke functions from other Fabric items
          • e.g. notebooks, Power BI reports, data pipelines
      • via edit functions directly in the Fabric portal [1]
        • via in-browser tooling, VS Code extension
      • supports the Python 3.11 runtime [1]
    • {goal} reusability
      • by creating libraries of standardized functionality [1]
        • can be used in many solutions across the organization [1]
    • {goal} customization
      • the applications are tailored to customers’ needs [1]
    • {goal} encapsulation
      • functions can perform various tasks to build sophisticated workflows [1]
    • {goal} external connectivity
      • data functions can be invoked from external client applications using a REST endpoint [1]
      • allows integrations with external systems [1]
  • {feature} programming model
    • SDK that provides the necessary functionality to author and publish runnable functions in Fabric [4]
    • {benefit} allows to seamlessly integrate with other items in the Fabric ecosystem [4]
      • e.g. Fabric data sources
    • {concept} user data functions item 
      • contains one or many functions that can be invoked from the Fabric portal using the provided REST endpoint [4]
        • from another Fabric item, or from an external application 
      • each function is a method in Python script that allows passing parameters and returning an output to the invoker [4]
    • {component} fabric.functions library 
      • provides the code needed to create user data functions in Python [4]
      • imported in the template by default [4]
    • {method} fn.UserDataFunctions() 
      • provides the execution context [4]
      • added at the beginning of the code file in all new user data functions items, before any function definitions [4]
    • {concept} functions
      • every function is identified with a @udf.function() decorator
        • can be invoked individually from the portal or an external invoker [4]
        • functions without the decorator can be invoked directly [4]
  • {feature} invocation logs
    • function invocations are logged
    • {benefit} allows to check the status or perform debugging  [3]
    • {limitation} can take few minutes to appear [3]
      • {recommendation} refresh the page after a few minutes [3]
    • {limitation} daily ingestion limit: 250 MB [3]
      • reset the next day 
        • a new log is made available [3]
    • {limitation} logs are sampled while preserving a statistically correct analysis of application data [3]
      • the sampling is done by User data functions to reduce the volume of logs ingested [3]
      • if some logs are partially missing, it might be because of sampling [3]
    • {limitation} supported log types: information, error, warning, trace [3]
  • {feature} Manage connections
    • {limitation} only supports connecting to Fabric-native data sources [2]
  • {limitation} only available in a subset of Fabric regions [2]
  • {limitation} editable by the owner only [2]
    • can only be modified and published by the user who is the owner of the User Data Functions Fabric item [2]
  • {limitation} adds further keywords to the list of reserved keywords
    • namely: req, context, reqInvocationId
    • can't be used as parameter names or function names [2]

References:
[1] Microsoft Learn (2025) Microsoft Fabric: What is Fabric User data functions (Preview)? [link]
[2] Microsoft Learn (2025) Microsoft Fabric: Service details and limitations of Fabric User Data Functions [link]
[3] Microsoft Learn (2025) Microsoft Fabric: User data functions invocation logs (Preview) [link]
[4] Microsoft Learn (2025) Microsoft Fabric: Fabric User data functions programming model overview (Preview) [link]
[5] Microsoft Fabric Updates Blog (2025) Announcing Fabric User Data Functions (Preview) [link]
[6] Microsoft Learn (2025) Microsoft Fabric: Use the Functions activity to run Fabric user data functions and Azure Functions [link

Resources:
[R1] Microsoft Fabric Updates Blog (2025) Utilize User Data Functions in Data pipelines with the Functions activity (Preview) [link

Acronyms:
PyPI - Python Package Index
REST - Representational State Transfer
SDK - Software Development Kit
UDF - User Data Function
VS - Visual Studio

06 April 2025

🏭🗒️Microsoft Fabric: Query Optimizer in Warehouse [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 6-Apr-2025

[Microsoft Fabric] Hints in Warehouse
  • {def} keywords that users can add to SQL statements to provide additional information or instructions to the query optimizer [2]
    • options or strategies specified for enforcement by the SQL Server query processor [1]
      • applicable to SELECT, INSERT, UPDATE, or DELETE statements [1]
  • {benefit} help improve the performance, scalability, or consistency of queries by overriding the default behavior of the query optimizer [2]
  • {type} join hints
    • specify what join strategy/algorithm used between two tables [3]
      • improves the performance of queries that involve large or complex joins [2]
      • specified in the FROM clause of a query [1]
      • if a join hint is specified for any two tables, the query optimizer automatically enforces the join order for all joined tables in the query [3]
        • based on the position of the ON keywords [3]
          • when a CROSS JOIN is used without the ON clause, parentheses can be used to indicate the join order [3]
    • looping
      • via LOOP JOIN
      • {restriction} can't be specified together with RIGHT or FULL as a join type [3]
    • hashing
      • via HASH JOIN
    • merging
      • via MERGE JOIN
    • REPLICATE
      • causes a broadcast move operation
        • a specific table to be replicated across all distribution nodes [2]
      • with INNER or LEFT` join
        • the broadcast move operation will replicate the right side of the join to all nodes [2]
      • with RIGHT join
        • the broadcast move operation will replicate the left side of the join to all nodes [2]
      • with FULL` join
        • an estimated plan cannot be created [2]
    • REDISTRIBUTE [(colsCount)]
      • ensures two data sources are distributed based on JOIN clause columns [2]
      • handles multiple join conditions, specified by the first n columns in both tables, where n is the column_count argument [2]
      • redistributing data optimizes query performance by evenly spreading data across nodes during intermediate steps of execution [2]
      • the (columns_count) argument is only supported in MF warehouse [2]
  • {type} query hint
    • specify that the indicated hints are used in the scope of a query [3]
      • affect all operators in the statement
        • [UNION only the last query involving a UNION operation can have the OPTION clause [3]
      • specified as part of the OPTION clause [3]
      • Error 8622 occurs if one or more query hints cause the Query Optimizer not to generate a valid plan [3]
    • used via the OPTION clause at the end of a query [2]
      • followed by the name of the query hint and its optional parameters in parentheses [2]
      • multiple hints can be used in the same query, separated by commas
        • e.g. FORCE ORDER and MAX_GRANT_PERCENT
          • instruct the QO to preserve the join order specified in the query and to limit the memory grant to 20 percent of the available memory
    • {hint} HASH GROUP
      • specifies that the QO should use a hash-based algorithm for the GROUP BY operation [2]
      • {benefit} can improve the performance of queries that involve large or complex grouping sets [2]
    • {hint} ORDER GROUP
      • specifies that the QO should use a sort-based algorithm for the GROUP BY operation [2]
      • {benefit} can improve the performance of queries that involve small or simple grouping sets [2]
    • {hint} MERGE UNION
      • specifies that the QO should use a merge-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improve the performance of queries that involve sorted inputs [2]
    • {hint} HASH UNION
      • specifies that the query optimizer should use a hash-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improve the performance of queries that involve unsorted or large inputs [2]
    • {hint} CONCAT UNION
      • specifies that the QO should use a concatenation-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improves the performance of queries that involve distinct or small inputs[2]
    • {hint} FORCE ORDER
      • specifies that the QO should preserve the join order specified in the query [2]
      • {benefit} can improves the performance or consistency of queries that involve complex join conditions or hints [2]
    • {hint} FORCE SINGLE NODE PLAN/FORCE DISTRIBUTED PLAN
      • allows to choose whether to force a single node plan or a distributed plan for query’s execution [2]
    • {hint} USE HINT
      • adds one or more extra hints to the query processor, where the hints are specified with a hint name inside single quotation marks inside OPTION clause [2] OPTION(USE HINT(‘ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES’)) 
      • used with several hint names, changing the behavior of CE derivation
        • {hint name} ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters [2]
          • MF assumes full correlation among filters when a high level of underestimation on AND predicates for filters is observed [2]
          • [SQL Server] equivalent TF 4137 [4]
        • {hint name} ASSUME_FULL_INDEPENDENCE_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters  [2]
          • MF assumes full independence among filters [2]
            • if a high level of overestimation on AND predicates for filters is observed, this hint can help produce a better estimate [2]
        • {hint name} ASSUME_PARTIAL_CORRELATION_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters [2]
          • {default} MF assumes partial correlation among filters [2]
            • ⇐ it is unlikely that this hint will help improve the estimates [2]
        • {hint name} ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS
          • applies when calculating cardinality estimate for joins [2]
          • uses Simple Containment assumption instead of the default Base Containment assumption [2]
          • [SQL Server] equivalent TF 9476 [4]
  • {type} table hints
    • none enforced currently

References:
[1] Microsoft Learn (2025) SQL: Hints (T-SQL) [link]
[2] Microsoft Fabric Updates Blog (2025) Hints in Fabric Data Warehouse [link]
[3] Microsoft Learn (2025) SQL: OPTION clause (T-SQL) [link]
[4] Microsoft Support (2016) KB3189813 - Update introduces USE HINT query hint argument in SQL Server 2016 [link]

Resources:
[R1] Microsoft Learn (2025) SQL Server: Query hints (T-SQL) [link]
[R2] Most Useful Query Hints [link]

Acronyms:
MF - Microsoft Fabric
TF - Trace Flag
QO - Query Optimizer

28 March 2025

🏭🗒️Microsoft Fabric: OneLake Role-Based Access Control (RBAC) [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 28-Mar-2025

[Microsoft Fabric] OneLake Role-based access control (RBAC)

  • {def} security framework that allows to manage access to resources by assigning roles to users or groups 
    • applies to Lakehouse Items only [1]
    • restricts data access for users with Workspace Viewer or read access to a lakehouse [1]
    • doesn't apply to Workspace Admins, Members, or Contributors [1]
      • ⇒ supports only Read level of permissions [1]
    • uses role assignments to apply permissions to its members
      • assigned to 
        • individuals
        • security groups
        • Microsoft 365 groups
        • distribution lists
        • ⇐ every member of the user group gets the assigned role [1]
      • users in multiple groups get the highest level of permission that is provided by the roles [1]
    • managed through the lakehouse data access settings [1]
    • when a lakehouse is created, OneLake generates a default RBAC Role named Default Readers [1]
      • allows all users with ReadAll permission to read all folders in the Item [1]
    • permissions always inherit to the entire hierarchy of the folder's files and subfolders [1]
    • provides automatic traversal of parent items to ensure that data is easy to discover [1]
      • ⇐ similar to Windows folder permissions [1]
      • [shortcuts] shortcuts to other OneLake locations have specialized behavior [1]
        • the access to a OneLake shortcut is determined by the target permissions of the shortcut [1]
          • when listing shortcuts, no call is made to check the target access [1]
            • ⇒ when listing a directory all internal shortcuts will be returned regardless of a user's access to the target [1]
              • when a user tries to open the shortcut the access check will evaluate and a user will only see data they have the required permissions to see [1]
    •  enable you to restrict the data access in OneLake only to specific folders [1]
  • {action} share a lakehouse
    • grants other users or a group of users access to a lakehouse without giving access to the workspace and the rest of its items [1]
    • found through 
      • Data Hub 
      • 'Shared with Me' section in Microsoft Fabrics
  • [shortcuts] permissions always inherit to all Internal shortcuts where a folder is defined as target [1]
    • when a user accesses data through a shortcut to another OneLake location, the identity of the calling user is used to authorize access to the data in the target path of the shortcut [1]
      • ⇒ the user must have OneLake RBAC permissions in the target location to read the data [1]
      • defining RBAC permissions for the internal shortcut is not allowed [1]
        • must be defined on the target folder located in the target item [1]
        • OneLake enables RBAC permissions only for shortcuts targeting folders in lakehouse items [1]


References:
[1] Microsoft Learn (2024) Fabric: Role-based access control (RBAC) [link]
[2] Microsoft Learn (2024) Best practices for OneLake security [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Storage
RBAC - Role-Based Access Control

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

  • Apache software library
    • backend technology that make storing data and running large-scale parallel computations possible
    • open-source framework 
    • widely adopted 
    • implements special versions of the HDFS
      • enables applications to scale to petabytes of data employing commodity hardware
    • based on MapReduce API 
      • software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
      • {function} Mapper
        • consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]
        • ⇐ analysis usually involve filter and sorting operations) [2]
      • {function} Reducer
        • consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]
  • {benefit} economical scalable storage mode
    • can run on commodity hardware that in turn utilizes commodity disks
      • the price point per terabyte is lower than that of almost any other technology [1]
  • {benefit} massive scalable IO capability
    • aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
    • adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]
      • ⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]
  • {characteristic} reliability
    • enabled by fault-tolerant design
    • ability to replicate by MapReduce execution
      • ⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes
    • data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1] 
      • ⇐ failure of a server does not result in a loss of data [1]
        • ⇐ the job continues even if a server fails
          • ⇐ the processing switches to another server [1]
      • every piece of data is usually replicated across three nodes
        • ⇐ can be located on separate server racks to avoid any single point of failure [1]
  • {characteristic} scalable processing model
    • MapReduce represents a widely applicable and scalable distributed processing model
    • capable of brute-forcing acceptable performance for almost all algorithms [1]
      • not the most efficient implementation for all algorithms
  • {characteristic} schema on read
    • the imposition of structure can be delayed until the data is accessed
    • ⇐ as opposed to the schema on write mode 
    • ⇐ used by relational data warehouses
    • data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]
      • {advantage} data can be quickly ingest from the various forms [1]
        •  this is sometimes referred to as schema on read,  [1]
  • {architecture} Hadoop 1.0
    • mixed nodes
      • the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]
        • each server supplies both data storage and processing capacity (CPU and memory) [1]
    • specialized nodes
      • job tracker node 
        • coordinates the scheduling of jobs run on the Hadoop cluster [1]
      • name node 
        • sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]
      • {disadvantage} architecture limited to MapReduce workloads [1]
      • {disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]
  • {architecture} Hadoop 2.0 
    • layers on top of the Hadoop 1.0 architecture [1]
    • {concept} YARN (aka Yet Another Resource Negotiator)
      • improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]
        • {process} Resource Manager 
          • controls access to the clusters resources (memory, CPU)
        • {process} Application Manager 
          • (one per job) controls task execution
    • treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]
      • allows Hadoop to run tasks based on more complex processing models [1]
  • {concept} Distributed File System 
    • a protocol used for storage and replication of data [1]

Acronyms:
DFS - Distributed File System
DWH - Data Warehouse
HDFS - Hadoop Distributed File System
YARN - Yet Another Resource Negotiator 

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

27 March 2025

#️⃣Software Engineering: Programming (Part XVII: More Thoughts on AI)

Software Engineering Series
Software Engineering Series

I've been playing with AI-based prompting in Microsoft 365 and Edge Copilot for SQL programming tasks and even for simple requests I got wrong or suboptimal solutions. Some of the solutions weren’t wrong by far, though it was enough for the solution to not work at all or give curious answers. Some solutions were even more complex than needed, which made their troubleshooting more challenging, to the degree that was easier to rewrite the code by myself. Imagine when such wrong solutions and lines of reasoning propagate uncontrolled within broader chains of reasoning! 

Some of the answers we get from AI can be validated step by step, and the logic can be changed accordingly, though this provides no guarantee that the answers won't change as new data, information, knowledge is included in the models, or the model changes, directly or indirectly. In Software Development, there’s a minimum set of tests that can and should be performed to assure that the input generated matches the expectations, however in AI-based solutions there’s no guarantee that what worked before will continue to work.

Moreover, small errors can propagate in a chain-effect creating curious wrong solutions. AI acts and probably will continue to act as a Pandora's box. So, how much can we rely on AI, especially when the complexity of the problems and their ever-changing nature is confronted with a model highly sensitive to the initial or intermediary conditions? 

Some of the answers may make sense, and probably also the answers can be better to some degree than the decisions made by experts, though how far do we want to go? Who is ready to let his own life blindly driven by the answers provided by an AI machine just because it can handle certain facts better than us? Moreover, the human brain is wired to cope with uncertainty, morality and other important aspects that can enhance the quality of the decisions, even if the decisions aren't by far perfect

It’s important to understand the sensitivity of AI models and outputs to the initial and even intermediate conditions on which such models are based, respectively what is used in their reasoning and how slight changes can result in unexpected effects. Networks, independently whether they are or not AI-based, lead to behavior that can be explainable to some degree as long full transparency of the model and outcomes of the many iterations is provided. When AI models behave like black boxes there’s no guarantee of the outcomes, respectively transparence on the jumps made from one state of the network to the other, and surprises can appear more often than we expect or are prepared to accept. 

Some of the successes rooted in AI-based reasoning might happen just because in similar contexts people are not ready to trust their reasoning or take a leap of faith. AI tends to replace all these aspects that are part of human psychology, logic and whatever is part of the overall process. The eventual successes are thus not an immediate effect of the AI capabilities, but just that we took a shortcut. Unfortunately, this can act like a sharp blade with two edges. 

I want to believe that AI is the solution to humanity's problems, and probably there are many areas of applicability, though letting AI control our lives and the over-dependence on AI can on long term cause more problems than AI and out society can solve. The idea of AI acting as a Copilot that can be used to extrapolate beyond our capabilities is probably not wrong, though one should keep the risks and various outcomes in sight!

Previous Post <<||>> Next Post

🧭Business Intelligence: Perspectives (Part 29: Navigating into the Unknown)

Business Intelligence Series
Business Intelligence Series

One of the important challenges in Business Intelligence and the other related knowledge domains is that people try to oversell ideas, overstretching, shifting, mixing and bending the definition of concepts and their use to suit the sales pitch or other related purposes. Even if there are several methodologies built around data that attempt to provide a solid foundation on which organizations can build upon, terms like actionable, value, insight, quality or importance continue to be a matter of perception, interpretation, and quite often be misused. 

It's often challenging to define precisely such businesses concepts especially there are degrees of fuzziness that may apply to the different contexts that are associated with them. What makes a piece of signal, data, information or knowledge valuable, respectively actionable? What is the value, respectively values we associate with a piece or aggregation of information, insight or degree of quality? When do values, changes, variations and other aspects become important, respectively can be ignored? How much can one generalize or particularize certain aspects? And, many more such questions can be added to this line of inquiry. 

Just because an important value changed, no matter in what direction, it might mean nothing as long as the value moves in certain ranges, respectively other direct or indirect conditions are met or not. Sometimes, there are simple rules and models that can be used to identify the various areas that should trigger different responses, respectively actions, though even small variations can increase the overall complexity multifold. There seems to be certain comfort in numbers, even if the same numbers can mean different things to different people, at different points in time, respectively contexts.

In the pursuit to bridge the multitude of gaps and challenges, organization attempt to arrive at common definitions and understanding in what concerns the business terms, goals, objectives, metrics, rules, procedures, processes and other points of focus associated with the respective terms. Unfortunately, many such foundations barely support the edifices built as long as there’s no common mental models established!

Even if the use of shared models is not new, few organizations try to make the knowledge associated with them explicit, respectively agree on and evolve a set of mental models that reflect how the business works, what is important, respectively can be ignored, which are the dependent and independent aspects, etc. This effort can prove to be a challenge for many organizations, especially when several leaps of faith must be made in the process.

Independently on whether organizations use shared mental models, some kind of common ground must be achieved. It starts with open dialog, identifying the gaps, respectively the minimum volume of knowledge required for making progress in the right direction(s). The broader the gaps and the misalignment, the more iterations are needed to make progress! And, of course, one must know which are the destinations, what paths to follow, what to ignore, etc. 

It's important how we look at the business, and people tend to use different filters (aka glasses or hats) for this purpose. Simple relationships between the various facts are ideal, though uncommon. There’s a chain of causality that may trigger a certain change, though more likely one deals with a networked structure of cause-effect relationships. The world is more complex than we (can} imagine. We try to focus on the aspects we are aware of, respectively consider as important. However, in a complex world also small variations in certain areas can shift the overall weight to aspects outside of our focus, influence or area of responsibility. Quite often, what we don’t know is more important than what we know!

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented. 

Last updated: 26-Mar-2025

Read and Write Operations in Polaris [2]

[Microsoft Fabric] Polaris SQL Pool

  • {def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities
    • designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
    • based on a robust DCP 
      • designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
      • a highly-available micro-service architecture with well-defined responsibilities [2]
        • data and query processing is packaged into units (aka tasks) 
          • can be readily moved across compute nodes and re-started at the task level
        • widely-partitioned data with a flexible distribution model [2]
        • a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
        • a framework for fine-grained monitoring and flexible scheduling of tasks [2]
  • {component} SQL Server Front End (SQL-FE)
    • responsible for 
      • compilation
      • authorization
      • authentication
      • metadata
        • used by the compiler to 
          • {operation} generate the search space (aka MEMO) for incoming queries
          • {operation} bind metadata to data cells
          • leveraged to ensure the durability of the transaction manifests at commit [2]
            • only transactions that successfully commit need to be actively tracked to ensure consistency [2]
            • any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]
  • {component} SQL Server Backend (SQL-BE)
    • used to perform write operations on the LST [2]
      • inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
      • a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]
        • SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes  [2]
        • each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]
          • the serialized information is then uploaded as a block to the manifest file
          • uploading the block does not yet make any visible changes to the file [2]
            • each block is identified by a unique ID generated on the writing SQL BE [2]
        • after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]
          • the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]
      • the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]
        • at this point, the changes to the file on storage will become effective [2]
      • changes to the manifest file are not visible until the Commit operation on the SQL FE
        • the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]
      • the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]
    • [read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE
      • the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]
        • queries over these are processed by the SQL Server query execution engine [2]
        • the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]
          • enables the cache to be used by different operations operating on different snapshots of the table [2]
          • enables the cache to be incrementally updated as new transactions commit [2]
  • {feature} supports explicit user transactions
    • can execute multiple statements within the same transaction in a consistent way
      • the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]
        • changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]
    • [multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
    • {write operations} the behavior of the SQL BE depends on the type of the operation.
      • insert operations 
        • only add new data and have no dependency on previous changes [2]
        • the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
        • the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks
          • ⇐ effectively appends the data to the manifest file [2]
    • {update|delete operations} 
      • handled differently 
        • ⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]
          • e.g. an update operation can be followed by another update operation touching the same rows
        • the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]
      • SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]
        • the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]
  • {concept} Distributed Query Processor (DQP)
    • responsible for 
      • distributed query optimization
      • distributed query execution
      • query execution topology management
  • {concept} Workload Management (WLM)
    •  consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]
      • each compute server runs two micro-services
        • {service} Execution Service (ES) 
          • responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]
        • {service} SQL Server instance
          • used as the back-bone for execution of the template query for a given task  [2]
            • ⇐ holds a cache on top of local SSDs 
              • in addition to in-memory caching of hot data
            • data can be transferred from one compute server to another
              • via dedicated data channels
              • the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
              • the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]
  • {concept} cell data abstraction
    • the key building block that enables to abstract data stores
      • abstracts DQP from the underlying store [1]
      • any dataset can be mapped to a collection of cells [1]
      • allows distributing query processing over data in diverse formats [1]
      • tailored for vectorized processing when the data is stored in columnar formats [1] 
      • further improves relational query performance
    • 2-dimenstional
      • distributions (data alignment)
      • partitions (data pruning)
    • each cell is self-contained with its own statistics [1]
      • used for both global and local QO [1]
      • cells can be grouped physically in storage [1]
      • queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]
    • {concept} distributed query processing (DQP) framework
      • operates at the cell level 
      • agnostic to the details of the data within a cell
        • data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]
  • {concept} dataset
    • logically abstracted as a collection of cells [1] 
    • can be arbitrarily assigned to compute nodes to achieve parallelism [1]
    • uniformly distributed across a large number of cells 
      • [scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
      •  such that they can be processed in parallel across nodes
  • {concept} session
    • supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
    • all data are accessible from any session [1]
      • multiple sessions can access all underlying data concurrently  [1]
  • {concept} Physical Metadata layer
    • new layer introduced in the SQL Server storage engine [2]
See also: Polaris

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric [link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The Polaris engine [link]

Acronyms:
CPU - Central Processing Unit
DAG - Directed Acyclic Graph
DB - Database
DCP - Distributed Computation Platform 
DQP - Distributed Query Processing 
DWH - Data Warehouses 
ES - Execution Service
LST - Log-Structured Table
SQL BE - SQL Backend
SQL FE - SQL Frontend
SSD - Solid State Disk
WAL - Write-Ahead Log
WLM - Workload Management

🏭🗒️Microsoft Fabric: External Data Sharing [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Mar-2025

External data sharing
External data sharing [1]

[Microsoft Fabric] External data sharing 

  • {def} feature that enables Fabric users to share data from their tenant with users in another Fabric tenant (aka cross-tenant sharing) [1]
    • the data is shared in-place from OneLake storage locations in the sharer's tenant [1]
      • ⇒ no data is actually copied to the other tenant [1]
      • creates a OneLake shortcut in the other tenant that points back to the original data in the sharer's tenant [1]
      • data is exposed as read-only [1]
      • data can be consumed by any OneLake compatible Fabric workload in that tenant [1]
    • {benefit} allows for efficient and secure data sharing without duplicating data
      • the shared data remains read-only for the consumer, ensuring data integrity and consistency [2]
      • multiple tables and folders can be shared at once [2]
  • {prerequisite} Fabric admins must turn on external data sharing both in the sharer's tenant and in the external tenant
    • by specifying who can create and accept external data shares [1]
    • users can share data residing in tables or files within supported Fabric items [1]
      • require standard Fabric read and reshare permissions for the item being shared [1]
      • the user creating the share invites a user from another tenant with a link to accept the external data share [1]
        • upon accepting the share, the recipient chooses a lakehouse where a shortcut to the shared data will be created [1]
        • the links work only for users in external tenants
          • for sharing data within the same OneLake storage accounts with users in the same tenant, use OneLake shortcuts [1]
        • {limitation} shortcuts contained in folders that are shared via external data sharing won't resolve in the consumer tenant [1]
    • access is enabled via a dedicated Fabric-to-Fabric authentication mechanism 
    • ⇐ doesn’t require Entra B2B guest user access [1]
  • {operation} create an external data share in the provider tenant)
    • external data shares can be created for tables or files in lakehouses and warehouses, and in KQL, SQL, mirrored databases [1]
    • {limitation} the sharer can't control who has access to the data in the consumer's tenant [1]
  • {operation} accept an external data share in consuming tenant)
    • only lakehouses can be chosen for the operation
    • the consumer can grant access to the data to anyone [1]
      • incl. guest users from outside the consumer's organization [1]
    • data can be transferred across geographic boundaries when it's accessed within the consumer's tenant [1]
  • {operation} revoke extern data shares
    • any user in the sharing tenant with read and reshare permissions on an externally shared item can revoke the external data share at any time [1]
      • via Manage permissions >> External data shares tab
      • can be performed of any item the user has read and reshare permissions on [3]
      • {warning}a revoked external data share can't be restored [3] 
        • irreversibly severs all access from the receiving tenant to the shared data [3]
        • a new external data share can be created instead [3]
  • applies to
    • lakehouse
      • an entire lakehouse schema can be shared [2]
        • shares all the tables in the schema [2]
        • any changes to the schema are immediately reflected in the consumer’s lakehouse [2]
    • mirrored database
    • KQL database
    • OneLake catalog
  • can be consumed via 
    • Spark workloads
      • notebooks or Spark
    • lakehouse SQL Analytics Endpoint
    • semantic models
    • ⇐ data can be shared from a provider and consumed in-place via SQL queries or in a Power BI report [2]
  • {feature} external data sharing APIs
    • support service principals for admin and user operations
      • can be used to automate the creation or management of shares [2]
        • supports service principals or managed identity [2]
    • {planned} data warehouse support
      • share schemas or tables from a data warehouse [2]
    • {planned} shortcut sharing
      • share OneLake shortcuts using external data sharing [2]
      • ⇒ data residing outside of Fabric will be externally shareable
        • S3, ADLS Gen2, or any other supported shortcut location [2]
    • {planned} consumer APIs
      • consumer activities, including viewing share details and accepting shares, will soon be available via API [2]

References:
[1] Microsoft Learn (2024) External data sharing in Microsoft Fabric [link]
[2] Microsoft Fabric Updates Blog (2025) External data sharing enhancements out now [link]
[3] Microsoft Learn (2024) Fabric: Manage external data shares [link]

Resources:
[R1] Analytics on Azure Blog (2024) External Data Sharing With Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Storage
API - Application Programming Interface
B2B - business-to-business
KQL - Kusto Query Language

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2024

[Microsoft Fabric] Security in Warehouse
  • {def} suite of technologies aimed at safeguarding sensitive information in Fabric [1]
    • leverages SQL engine’s security features [1]
      • allows for security mechanism at the warehouse level [1]
      • ⇐ the warehouse and SQL analytics endpoint items also allow for the defining of native SQL security [4]
        • the permissions configured only apply to the queries executed against the respective surfaces [4]
      • the access to OneLake data is controlled separately through OneLake data access roles [4]
        • {recommendation}don’t include users with SQL specific permissions in a OneLake data access role [4] 
          •  allows to ensure users don't see data they don't have SQL access to [5r] 
    • supports a range of data protection features that enable administrators to shield sensitive data from unauthorized access [1]
      • ⇐ across warehouses and SQL analytics endpoints without necessitating changes to applications [1]
    • {type} object-level security (OLS)
      • permissions governing DML operations [1]
        • applies to tables and views
        • ⇐ when denied, the user will be prevented from performing the respective operation
        • SELECT
          • allows users to view the data within the object [1]
        • INSERT
          • allows users to insert data in the object [1]
        • UPDATE
          • allows users to update data within the object [1]
        • DELETE
          • allows users to delete the data within the object [1]
      • permissions can be granted, revoked or denied on database objects [1]
        •  tables and views
        • GRANT
          • permission is granted to user or role [1]
        • DENY
          • permissions is denied to user or role [1]
        • REVOKE
          • permissions is revoked to user or role [1]
        • ALTER
          • grants the user the ability to change the definition of the object [1]
        • CONTROL
          • grants the user all rights to the object [1]
      • {principle} least privilege
        • users and applications should only be given the permissions needed in order for them to complete the task
    • {type} column-level security (CLS)
      • allows to restrict column access to sensitive data [1]
        • provides granular control over who can access specific pieces of data [1]
          •  enhances the overall security of the data warehouse [1]
      • steps
        • identify the sensitive columns [1]
        • define access roles [1]
        • assign roles to users [1]
        • implement access control [1]
          • restrict access to ta column based on user's role [1]
    • {type} row-level security (RLS)
      • provides granular control over access to rows in a table based on group membership or execution context [1]
        • using WHERE clause filters [1]
      • works by associating a function (aka security predicate) with a table [1]
        • defined to return true or false based on certain conditions [1]
          • ⇐ typically involving the values of one or more columns in the table [1]
          • when a user attempts to access data in the table, the security predicate function is invoked [1]
            • if the function returns true, the row is accessible to the user; otherwise, the row doesn't show up in the query results [1]
        • the predicate can be as simple/complex as required [1]
        • the process is transparent to the user and is enforced automatically by SQL Server
          • ⇐ ensures consistent application of security rules [1]
      • implemented in two main steps:
        • filter predicates 
          • an inline table-valued function that filters the results based on the predicate defined [1]
        • security policy
          • invokes an inline table-valued function to protect access to the rows in a table [1]
            • because access control is configured and applied at the warehouse level, application changes are minimal - if any [1]
            • users can directly have access to the tables and can query their own data [1]
      • {recommendation} create a separate schema for predicate functions and security policies [1]
      • {recommendation} avoid type conversions in predicate functions [1]
      • {recommendation}avoid using excessive table joins and recursion in predicate functions [1]
        • allows to maximize performance,  
    • {type} dynamic data masking (DDM) 
      • allows to limits data exposure to nonprivileged users by obscuring sensitive data [1]
        • e.g. email addresses 
      • {benefit} enhance the security and manageability of the data [1]
      • {feature} real-time masking
        • when querying sensitive data, DDM applies dynamic masking to it in real time [1]
          • the actual data is never exposed to unauthorized users, thus enhancing the security of your data [1]
        • straightforward to implement [1]
        • doesn’t require complex coding, making it accessible for users of all skill levels [1]
        • {benefit} the data in the database isn’t changed when DDM is applied
          •   the actual data remains intact and secure, while nonprivileged users only see a masked version of the data [1]
      • {operation} define masking rule
        • set up at column level [1]
        • offers a suite of features [1]
          • comprehensive and partial masking capabilities [1]
          • supports several masking types
            • help prevent unauthorized viewing of sensitive data [1]
              • by enabling administrators to specify how much sensitive data to reveal [1]
                •   minimal effect on the application layer [1]
            • applied to query results, so the data in the database isn't changed 
              •   allows many applications to mask sensitive data without modifying existing queries  [1]
          • random masking function designed for numeric data [1]
        • {risk} unprivileged users with query permissions can infer the actual data since the data isn’t physically obfuscated [1]
      • {recommendation} DDM should be used as part of a comprehensive data security strategy [1]
        • should include
          • the proper management of object-level security with SQL granular permissions [1]
          • adherence to the principle of minimal required permissions [1]
    • {concept} Dynamic SQL 
      • allows T-SQL statements to be generated within a stored procedure or a query itself [1]
        • executed via sp_executesql stored procedure
      • {risk} SQL injection attacks
        • use  QUOTENAME to sanitize inputs [1]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} granted through the Fabric workspace roles
      • the role automatically translates to a corresponding role in SQL that grants equivalent write access [4]
      • {recommendation} if a user needs write access to all warehouses and endpoints, assign the user to a workspace role [4]
        • use the Contributor role unless the user needs to assign other users to workspace roles [4]
      • {recommendation} grant direct access through SQL permissions if the user only needs to write to specific warehouses or endpoints [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to write to some or all the data [4]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} grant read access through the ReadData permission, granted as part of the Fabric workspace roles [4]
      •  ReadData permission maps the user to a SQL role that gives SELECT permissions on all tables in the warehouse or lakehouse
        • helpful if the user needs to see all or most of the data in the lakehouse or warehouse [4]
        • any SQL DENY permissions set on a particular lakehouse or warehouse still apply and limit access to tables [4]
        • row and column level security can be set on tables to restrict access at a granular level [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to read to some or all the data [4]
    • if the user needs access only to a specific lakehouse or warehouse, the share feature provides access to only the shared item [4]
      • during the share, users can choose to give only Read permission or Read + ReadData 
        • granting Read permission allows the user to connect to the warehouse or SQL analytics endpoint but gives no table access [4]
        • granting users the ReadData permissions gives them full read access to all tables in the warehouse or SQL analytics endpoint
      • ⇐ additional SQL security can be configured to grant or deny access to specific tables [4]

    References:
    [1] Microsoft Learn (2024) Secure a Microsoft Fabric data warehouse [link]
    [2] Data Mozart (2025) Lock Up! Understanding Data Access Options in Microsoft Fabric, by Nikola Ilic [link]
    [3] Microsoft Learn (2024) Security in Microsoft Fabric [link]
    [4] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R2] Microsoft Learn (2025) Fabric: Security for data warehousing in Microsoft Fabric [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    CLS - Column-Level Security
    DDM - Dynamic Data Masking
    DML - Data Manipulation Language 
    MF - Microsoft Fabric
    OLS - Object-Level Security
    RLS - Row-Level Security
    SQL - Structured Query Language
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.