Showing posts with label notes. Show all posts
Showing posts with label notes. Show all posts

28 March 2025

🏭🗒️Microsoft Fabric: Hadoop [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 28-Mar-2024

[Microsoft Fabric] Hadoop

  • Apache software library
    • backend technology that make storing data and running large-scale parallel computations possible
    • open-source framework 
    • widely adopted 
    • implements special versions of the HDFS
      • enables applications to scale to petabytes of data employing commodity hardware
    • based on MapReduce API 
      • software framework for writing jobs that process vast amounts of data [2] and enables work parallelization
      • {function} Mapper
        • consumes input data, analyzes it, and emits tuples (aka key-value pairs) [2]
        • ⇐ analysis usually involve filter and sorting operations) [2]
      • {function} Reducer
        • consumes tuples emitted by the Mapper and performs a summary operation that creates a smaller, combined result from the Mapper data [2]
  • {benefit} economical scalable storage mode
    • can run on commodity hardware that in turn utilizes commodity disks
      • the price point per terabyte is lower than that of almost any other technology [1]
  • {benefit} massive scalable IO capability
    • aggregate IO and network capacity is higher than that provided by dedicated storage arrays [1]
    • adding new servers to Hadoop adds storage, IO, CPU, and network capacity all at once [1]
      • ⇐ adding disks to a storage array might simply exacerbate a network or CPU bottleneck within the array [1]
  • {characteristic} reliability
    • enabled by fault-tolerant design
    • ability to replicate by MapReduce execution
      • ⇐ detects task failure on one node on the distributed system and restarts programs on other healthy nodes
    • data in Hadoop is stored redundantly in multiple servers and can be distributed across multiple computer racks [1] 
      • ⇐ failure of a server does not result in a loss of data [1]
        • ⇐ the job continues even if a server fails
          • ⇐ the processing switches to another server [1]
      • every piece of data is usually replicated across three nodes
        • ⇐ can be located on separate server racks to avoid any single point of failure [1]
  • {characteristic} scalable processing model
    • MapReduce represents a widely applicable and scalable distributed processing model
    • capable of brute-forcing acceptable performance for almost all algorithms [1]
      • not the most efficient implementation for all algorithms
  • {characteristic} schema on read
    • the imposition of structure can be delayed until the data is accessed
    • ⇐as opposed to the schema on write mode 
    • ⇐used by relational data warehouses
    • data can be loaded into Hadoop without having to be converted to a highly structured normalized format [1]
      • {advantage} data can be quickly ingest from the various forms [1]
        •  this is sometimes referred to as schema on read,  [1]
  • {architecture} Hadoop 1.0
    • mixed nodes
      • the majority of servers in a Hadoop cluster function both as data nodes and as task trackers [1]
        • each server supplies both data storage and processing capacity (CPU and memory) [1]
    • specialized nodes
      • job tracker node 
        • coordinates the scheduling of jobs run on the Hadoop cluster [1]
      • name node 
        • sort of directory that provides the mapping from blocks on data nodes to files on HDFS [1]
      • {disadvantage} architecture limited to MapReduce workloads [1]
      • {disadvantage} it provides limited flexibility with regard to scheduling and resource allocation [1]
  • {architecture} Hadoop 2.0 
    • layers on top of the Hadoop 1.0 architecture [1]
    • {concept} YARN (aka Yet Another Resource Negotiator)
      • improves scalability and flexibility by splitting the roles of the Task Tracker into two processes [1]
        • {process} Resource Manager 
          • controls access to the clusters resources (memory, CPU)
        • {process} Application Manager 
          • (one per job) controls task execution
    • treats traditional MapReduce as just one of the possible frameworks that can run on the cluster [1]
      • allows Hadoop to run tasks based on more complex processing models [1]
  • {concept} Distributed File System 
    • a protocol used for storage and replication of data [1]

Acronyms:
DFS - Distributed File System
DWH - Data Warehouse
HDFS - Hadoop Distributed File System
YARN - Yet Another Resource Negotiator 

References:
[1] Guy Harrison (2015) Next Generation Databases: NoSQL, NewSQL, and Big Data
[2] Microsoft Learn (2024) What is Apache Hadoop in Azure HDInsight? [link

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented. 

Last updated: 26-Mar-2025

Read and Write Operations in Polaris [2]

[Microsoft Fabric] Polaris SQL Pool

  • {def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities
    • designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
    • based on a robust DCP 
      • designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
      • a highly-available micro-service architecture with well-defined responsibilities [2]
        • data and query processing is packaged into units (aka tasks) 
          • can be readily moved across compute nodes and re-started at the task level
        • widely-partitioned data with a flexible distribution model [2]
        • a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
        • a framework for fine-grained monitoring and flexible scheduling of tasks [2]
  • {component} SQL Server Front End (SQL-FE)
    • responsible for 
      • compilation
      • authorization
      • authentication
      • metadata
        • used by the compiler to 
          • {operation} generate the search space (aka MEMO) for incoming queries
          • {operation} bind metadata to data cells
          • leveraged to ensure the durability of the transaction manifests at commit [2]
            • only transactions that successfully commit need to be actively tracked to ensure consistency [2]
            • any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]
  • {component} SQL Server Backend (SQL-BE)
    • used to perform write operations on the LST [2]
      • inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
      • a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]
        • SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes  [2]
        • each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]
          • the serialized information is then uploaded as a block to the manifest file
          • uploading the block does not yet make any visible changes to the file [2]
            • each block is identified by a unique ID generated on the writing SQL BE [2]
        • after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]
          • the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]
      • the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]
        • at this point, the changes to the file on storage will become effective [2]
      • changes to the manifest file are not visible until the Commit operation on the SQL FE
        • the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]
      • the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]
    • [read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE
      • the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]
        • queries over these are processed by the SQL Server query execution engine [2]
        • the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]
          • enables the cache to be used by different operations operating on different snapshots of the table [2]
          • enables the cache to be incrementally updated as new transactions commit [2]
  • {feature} supports explicit user transactions
    • can execute multiple statements within the same transaction in a consistent way
      • the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]
        • changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]
    • [multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
    • {write operations} the behavior of the SQL BE depends on the type of the operation.
      • insert operations 
        • only add new data and have no dependency on previous changes [2]
        • the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
        • the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks
          • ⇐ effectively appends the data to the manifest file [2]
    • {update|delete operations} 
      • handled differently 
        • ⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]
          • e.g. an update operation can be followed by another update operation touching the same rows
        • the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]
      • SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]
        • the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]
  • {concept} Distributed Query Processor (DQP)
    • responsible for 
      • distributed query optimization
      • distributed query execution
      • query execution topology management
  • {concept} Workload Management (WLM)
    •  consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]
      • each compute server runs two micro-services
        • {service} Execution Service (ES) 
          • responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]
        • {service} SQL Server instance
          • used as the back-bone for execution of the template query for a given task  [2]
            • ⇐ holds a cache on top of local SSDs 
              • in addition to in-memory caching of hot data
            • data can be transferred from one compute server to another
              • via dedicated data channels
              • the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
              • the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]
  • {concept} cell data abstraction
    • the key building block that enables to abstract data stores
      • abstracts DQP from the underlying store [1]
      • any dataset can be mapped to a collection of cells [1]
      • allows distributing query processing over data in diverse formats [1]
      • tailored for vectorized processing when the data is stored in columnar formats [1] 
      • further improves relational query performance
    • 2-dimenstional
      • distributions (data alignment)
      • partitions (data pruning)
    • each cell is self-contained with its own statistics [1]
      • used for both global and local QO [1]
      • cells can be grouped physically in storage [1]
      • queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]
    • {concept} distributed query processing (DQP) framework
      • operates at the cell level 
      • agnostic to the details of the data within a cell
        • data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]
  • {concept} dataset
    • logically abstracted as a collection of cells [1] 
    • can be arbitrarily assigned to compute nodes to achieve parallelism [1]
    • uniformly distributed across a large number of cells 
      • [scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
      •  such that they can be processed in parallel across nodes
  • {concept} session
    • supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
    • all data are accessible from any session [1]
      • multiple sessions can access all underlying data concurrently  [1]
  • {concept} Physical Metadata layer
    • new layer introduced in the SQL Server storage engine [2]
See also: Polaris

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric [link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The Polaris engine [link]

Acronyms:
CPU - Central Processing Unit
DAG - Directed Acyclic Graph
DB - Database
DCP - Distributed Computation Platform 
DQP - Distributed Query Processing 
DWH - Data Warehouses 
ES - Execution Service
LST - Log-Structured Table
SQL BE - SQL Backend
SQL FE - SQL Frontend
SSD - Solid State Disk
WAL - Write-Ahead Log
WLM - Workload Management

🏭🗒️Microsoft Fabric: External Data Sharing [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 26-Mar-2025

External data sharing
External data sharing [1]

[Microsoft Fabric] External data sharing 

  • {def} feature that enables Fabric users to share data from their tenant with users in another Fabric tenant (aka cross-tenant sharing) [1]
    • the data is shared in-place from OneLake storage locations in the sharer's tenant [1]
      • ⇒ no data is actually copied to the other tenant [1]
      • creates a OneLake shortcut in the other tenant that points back to the original data in the sharer's tenant [1]
      • data is exposed as read-only [1]
      • data can be consumed by any OneLake compatible Fabric workload in that tenant [1]
    • {benefit} allows for efficient and secure data sharing without duplicating data
      • the shared data remains read-only for the consumer, ensuring data integrity and consistency [2]
      • multiple tables and folders can be shared at once [2]
  • {prerequisite} Fabric admins must turn on external data sharing both in the sharer's tenant and in the external tenant
    • by specifying who can create and accept external data shares [1]
    • users can share data residing in tables or files within supported Fabric items [1]
      • require standard Fabric read and reshare permissions for the item being shared [1]
      • the user creating the share invites a user from another tenant with a link to accept the external data share [1]
        • upon accepting the share, the recipient chooses a lakehouse where a shortcut to the shared data will be created [1]
        • the links work only for users in external tenants
          • for sharing data within the same OneLake storage accounts with users in the same tenant, use OneLake shortcuts [1]
        • {limitation} shortcuts contained in folders that are shared via external data sharing won't resolve in the consumer tenant [1]
    • access is enabled via a dedicated Fabric-to-Fabric authentication mechanism 
    • ⇐ doesn’t require Entra B2B guest user access [1]
  • {operation} create an external data share in the provider tenant)
    • external data shares can be created for tables or files in lakehouses and warehouses, and in KQL, SQL, mirrored databases [1]
    • {limitation} the sharer can't control who has access to the data in the consumer's tenant [1]
  • {operation} accept an external data share in consuming tenant)
    • only lakehouses can be chosen for the operation
    • the consumer can grant access to the data to anyone [1]
      • incl. guest users from outside the consumer's organization [1]
    • data can be transferred across geographic boundaries when it's accessed within the consumer's tenant [1]
  • {operation} revoke extern data shares
    • any user in the sharing tenant with read and reshare permissions on an externally shared item can revoke the external data share at any time [1]
      • via Manage permissions >> External data shares tab
      • can be performed of any item the user has read and reshare permissions on [3]
      • {warning}a revoked external data share can't be restored [3] 
        • irreversibly severs all access from the receiving tenant to the shared data [3]
        • a new external data share can be created instead [3]
  • applies to
    • lakehouse
      • an entire lakehouse schema can be shared [2]
        • shares all the tables in the schema [2]
        • any changes to the schema are immediately reflected in the consumer’s lakehouse [2]
    • mirrored database
    • KQL database
    • OneLake catalog
  • can be consumed via 
    • Spark workloads
      • notebooks or Spark
    • lakehouse SQL Analytics Endpoint
    • semantic models
    • ⇐ data can be shared from a provider and consumed in-place via SQL queries or in a Power BI report [2]
  • {feature} external data sharing APIs
    • support service principals for admin and user operations
      • can be used to automate the creation or management of shares [2]
        • supports service principals or managed identity [2]
    • {planned} data warehouse support
      • share schemas or tables from a data warehouse [2]
    • {planned} shortcut sharing
      • share OneLake shortcuts using external data sharing [2]
      • ⇒ data residing outside of Fabric will be externally shareable
        • S3, ADLS Gen2, or any other supported shortcut location [2]
    • {planned} consumer APIs
      • consumer activities, including viewing share details and accepting shares, will soon be available via API [2]

References:
[1] Microsoft Learn (2024) External data sharing in Microsoft Fabric [link]
[2] Microsoft Fabric Updates Blog (2025) External data sharing enhancements out now [link]
[3] Microsoft Learn (2024) Fabric: Manage external data shares [link]

Resources:
[R1] Analytics on Azure Blog (2024) External Data Sharing With Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
ADLS - Azure Data Lake Storage
API - Application Programming Interface
B2B - business-to-business
KQL - Kusto Query Language

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2024

[Microsoft Fabric] Security in Warehouse
  • {def} suite of technologies aimed at safeguarding sensitive information in Fabric [1]
    • leverages SQL engine’s security features [1]
      • allows for security mechanism at the warehouse level [1]
      • ⇐ the warehouse and SQL analytics endpoint items also allow for the defining of native SQL security [4]
        • the permissions configured only apply to the queries executed against the respective surfaces [4]
      • the access to OneLake data is controlled separately through OneLake data access roles [4]
        • {recommendation} to ensure users with SQL specific permissions don't see data they don't have SQL access to, don't include those users in a OneLake data access role [4]
    • supports a range of data protection features that enable administrators to shield sensitive data from unauthorized access [1]
      • ⇐ across warehouses and SQL analytics endpoints without necessitating changes to applications [1]
    • {type} object-level security (OLS)
      • permissions governing DML operations [1]
        • applies to tables and views
        • ⇐ when denied, the user will be prevented from performing the respective operation
        • SELECT
          • allows users to view the data within the object [1]
        • INSERT
          • allows users to insert data in the object [1]
        • UPDATE
          • allows users to update data within the object [1]
        • DELETE
          • allows users to delete the data within the object [1]
      • permissions can be granted, revoked or denied on database objects [1]
        •  tables and views
        • GRANT
          • permission is granted to user or role [1]
        • DENY
          • permissions is denied to user or role [1]
        • REVOKE
          • permissions is revoked to user or role [1]
        • ALTER
          • grants the user the ability to change the definition of the object [1]
        • CONTROL
          • grants the user all rights to the object [1]
      • {principle} least privilege
        • users and applications should only be given the permissions needed in order for them to complete the task
    • {type} column-level security (CLS)
      • allows to restrict column access to sensitive data [1]
        • provides granular control over who can access specific pieces of data [1]
          •  enhances the overall security of the data warehouse [1]
      • steps
        • identify the sensitive columns [1]
        • define access roles [1]
        • assign roles to users [1]
        • implement access control [1]
          • restrict access to ta column based on user's role [1]
    • {type} row-level security (RLS)
      • provides granular control over access to rows in a table based on group membership or execution context [1]
        • using WHERE clause filters [1]
      • works by associating a function (aka security predicate) with a table [1]
        • defined to return true or false based on certain conditions [1]
          • ⇐ typically involving the values of one or more columns in the table [1]
          • when a user attempts to access data in the table, the security predicate function is invoked [1]
            • if the function returns true, the row is accessible to the user; otherwise, the row doesn't show up in the query results [1]
        • the predicate can be as simple/complex as required [1]
        • the process is transparent to the user and is enforced automatically by SQL Server
          • ⇐ ensures consistent application of security rules [1]
      • implemented in two main steps:
        • filter predicates 
          • an inline table-valued function that filters the results based on the predicate defined [1]
        • security policy
          • invokes an inline table-valued function to protect access to the rows in a table [1]
            • because access control is configured and applied at the warehouse level, application changes are minimal - if any [1]
            • users can directly have access to the tables and can query their own data [1]
      • {recommendation} create a separate schema for predicate functions, and security policies [1]
      • {recommendation} avoid type conversions in predicate functions [1]
      • {recommendation} to maximize performance, avoid using excessive table joins and recursion in predicate functions [1]
    • {type} dynamic data masking (DDM) 
      • allows to limits data exposure to nonprivileged users by obscuring sensitive data [1]
        • e.g. email addresses 
      • {benefit} enhance the security and manageability of the data [1]
      • {feature} real-time masking
        • when querying sensitive data, DDM applies dynamic masking to it in real time [1]
          • the actual data is never exposed to unauthorized users, thus enhancing the security of your data [1]
        • straightforward to implement [1]
        • doesn’t require complex coding, making it accessible for users of all skill levels [1]
        • {benefit} the data in the database isn’t changed when DDM is applied
          •   the actual data remains intact and secure, while nonprivileged users only see a masked version of the data [1]
      • {operation} define masking rule
        • set up at column level [1]
        • offers a suite of features [1]
          • comprehensive and partial masking capabilities [1]
          • supports several masking types
            • help prevent unauthorized viewing of sensitive data [1]
              • by enabling administrators to specify how much sensitive data to reveal [1]
                •   minimal effect on the application layer [1]
            • applied to query results, so the data in the database isn't changed 
              •   allows many applications to mask sensitive data without modifying existing queries  [1]
          • random masking function designed for numeric data [1]
        • {risk} unprivileged users with query permissions can infer the actual data since the data isn’t physically obfuscated [1]
      • {recommendation} DDM should be used as part of a comprehensive data security strategy [1]
        • should include
          • the proper management of object-level security with SQL granular permissions [1]
          • adherence to the principle of minimal required permissions [1]
    • {concept} Dynamic SQL 
      • allows T-SQL statements to be generated within a stored procedure or a query itself [1]
        • executed via sp_executesql stored procedure
      • {risk} SQL injection attacks
        • use  QUOTENAME to sanitize inputs [1]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} granted through the Fabric workspace roles
      • the role automatically translates to a corresponding role in SQL that grants equivalent write access [4]
      • {recommendation} if a user needs write access to all warehouses and endpoints, assign the user to a workspace role [4]
        • use the Contributor role unless the user needs to assign other users to workspace roles [4]
      • {recommendation} grant direct access through SQL permissions if the user only needs to write to specific warehouses or endpoints [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to write to some or all the data [4]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} grant read access through the ReadData permission, granted as part of the Fabric workspace roles [4]
      •  ReadData permission maps the user to a SQL role that gives SELECT permissions on all tables in the warehouse or lakehouse
        • helpful if the user needs to see all or most of the data in the lakehouse or warehouse [4]
        • any SQL DENY permissions set on a particular lakehouse or warehouse still apply and limit access to tables [4]
        • row and column level security can be set on tables to restrict access at a granular level [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to read to some or all the data [4]
    • if the user needs access only to a specific lakehouse or warehouse, the share feature provides access to only the shared item [4]
      • during the share, users can choose to give only Read permission or Read + ReadData 
        • granting Read permission allows the user to connect to the warehouse or SQL analytics endpoint but gives no table access [4]
        • granting users the ReadData permissions gives them full read access to all tables in the warehouse or SQL analytics endpoint
      • ⇐ additional SQL security can be configured to grant or deny access to specific tables [4]

    References:
    [1] Microsoft Learn (2024) Secure a Microsoft Fabric data warehouse [link]
    [2] Data Mozart (2025) Lock Up! Understanding Data Access Options in Microsoft Fabric, by Nikola Ilic [link]
    [3] Microsoft Learn (2024) Security in Microsoft Fabric [link]
    [4] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R2] Microsoft Learn (2025) Fabric: Security for data warehousing in Microsoft Fabric [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    CLS - Column-Level Security
    DDM - Dynamic Data Masking
    DML - Data Manipulation Language 
    MF - Microsoft Fabric
    OLS - Object-Level Security
    RLS - Row-Level Security
    SQL - Structured Query Language

    🏭🗒️Microsoft Fabric: Security [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 25-Mar-2024

    Microsoft Fabric Security
    Microsoft Fabric Security [2]
    [Microsoft Fabric] Security
    • {def} a comprehensive security framework designed for the Microsoft Fabric platform [1]
      • {goal} always on 
        • every interaction with Fabric is encrypted by default and authenticated using Microsoft Entra ID [1]
          • all communication between Fabric experiences travels through the Microsoft backbone internet [1]
          • data at rest is automatically stored encrypted [1]
          • support for extra security features [1]
            • ⇐ allow to regulate access to Fabric [1]
            • Private Links 
              • enable secure connectivity to Fabric by 
                • restricting access to the Fabric tenant from an Azure VPN
                • blocking all public access
              • ensures that only network traffic from that VNet is allowed to access Fabric features [1]
            • Entra Conditional Access 
          • the connection to data is protected by a firewall or a private network using trusted access [1]
            • access firewall enabled ADL Gen2 accounts securely [1]
              • can be limited to specific workspaces [1]
                • workspaces that have a workspace identity can securely access ADL Gen 2 accounts with public network access enabled, from selected virtual networks and IP addresses [1]
              • workspace identities can only be created in workspaces associated with a Fabric F SKU capacity [1]
          • helps users connect to services quickly and easily from any device and any network [1]
            • each request to connect to Fabric is authenticated with Microsoft Entra ID [1]
              • allows users to safely connect to Fabric from their corporate office, when working at home, or from a remote location [1]
          • {feature} Conditional Access
            • allows to secure access to Fabric on every connection by
              • defining a list of IPs for inbound connectivity to Fabric [1]
              • using MFA [1]
              • restricting traffic based on parameters such as country of origin or device type [1]
      • {goal} compliant
        • data sovereignty provided out-of-box with multi geo capacities [1]
        • support for a wide range of compliance standards [1]
        • Fabric services follow the SDL)
          • a set of strict security practices that support security assurance and compliance requirements [2]
          • helps developers build more secure software by reducing the number and severity of vulnerabilities in software, while reducing development cost [2]
      • {goal} governable
        • leverages a set of governance tools
          • data lineage
          • information protection labels
          • data loss prevention 
          • Purview integration 
      • configurable
        •  in accordance with organizational policies [1]
      • evolving 
        • new features and controls are added regularly [1]
    • {feature} managed private endpoints 
      • allow secure connections to data sources without exposing them to the public network or requiring complex network configurations [1]
        • e.g. as Azure SQL databases
    • {feature} managed virtual networks
      • virtual networks that are created and managed by Microsoft Fabric for each Fabric workspace [1]
      • provide network isolation for Fabric Spark workloads
        • the compute clusters are deployed in a dedicated network and are no longer part of the shared virtual network [1]
      • enable network security features
        • managed private endpoints
        • private link support
    • {feature} data gateway
      • allows to connect to on-premises data sources or a data source that might be protected by a firewall or a virtual network
      • {option} On-premises data gateway
        • acts as a bridge between on-premises data sources and Fabric 1[]
        • installed on a server within the network [1]
        • allows Fabric to connect to data sources through a secure channel without the need to open ports or make changes to the network [1]
      • {option} Virtual network (VNet) data gateway
        • allows to connect from Microsoft Cloud services to Azure data services within a VNet, without the need of an on-premises data gateway [1]
    • {feature} Azure service tags
      • allows to ingest data from data sources deployed in an Azure virtual network without the use of data gateways [1]
        • e.g. VMs, Azure SQL MI and REST APIs
      • can be used to get traffic from a virtual network or an Azure firewall
        • e.g. outbound traffic to Fabric so that a user on a VM can connect to Fabric SQL connection strings from SSMS, while blocked from accessing other public internet resources [1]
    • {feature} IP allow-lists
      • allows to enable an IP allow-list on organization's network to allow traffic to and from Fabric
      • useful for data sources that don't support service tags [1]
        • e.g. on-premises data sources
    • {feature} Telemetry
      • used to maintain performance and reliability of the Fabric platform [2]
      • the telemetry store is designed to be compliant with data and privacy regulations for customers in all regions where Fabric is available [2]
    • {process} authentication
      • relies on Microsoft Entra ID to authenticate users (or service principals) [2]
      • when authenticated, users receive access tokens from Microsoft Entra ID [2]
        • used to perform operations in the context of the user [2]
      • {feature} conditional access
        • ensures that tenants are secure by enforcing multifactor authentication [2]
          • allows only Microsoft Intune enrolled devices to access specific services [1] 
        • restricts user locations and IP ranges.
    • {process} authorization
      • all Fabric permissions are stored centrally by the metadata platform
        • Fabric services query the metadata platform on demand to retrieve authorization information and to authorize and validate user requests [2]
      • authorization information is sometimes encapsulated into signed tokens [2]
        • only issued by the back-end capacity platform [1]
        • include the access token, authorization information, and other metadata [1]
    • {concept} tenant metadata 
      • information about the tenant 
      • is stored in a metadata platform cluster to which the tenant is assigned
        • located in a single region that meets the data residency requirements of that region's geography [2]
        • include customer data 
        • customers can control where their workspaces are located
          • in the same geography as the metadata platform cluster
            • by explicitly assigning workspaces on capacities in that region [2]
            • by implicitly using Fabric Trial, Power BI Pro, or Power BI Premium Per User license mode [2]
              • all customer data is stored and processed in this single geography [2]
          • in Multi-Geo capacities located in geographies (geos) other than their home region [2]
            • compute and storage is located in the multi-geo region [2]
              • (including OneLake and experience-specific storage [2]
            • {exception} the tenant metadata remains in the home region
            • customer data will only be stored and processed in these two geographies [2]
    • {concept} data-at-rest
      • all Fabric data stores are encrypted at rest [2]
        • by using Microsoft-managed keys
        • includes customer data as well as system data and metadata [2]
        •  data is never persisted to permanent storage while in an unencrypted state [1]
          • data can be processed in memory in an unencrypted state [2]
      • {default} encrypted using platform managed keys (PMK)
        • Microsoft is responsible for all aspects of key management [2]
        • data-at-rest on OneLake is encrypted using its keys [3]
        • {alternative} Customer-managed keys (CMK) 
          • allow to encrypt data at-rest using customer keys [3]
            •   customer assumes full control of the key [3]
          • {recommendation} use cloud storage services with CMK encryption enabled and access data from Fabric using OneLake shortcuts [3]
            • data continues to reside on a cloud storage service or an external storage solution where encryption at rest using CMK is enabled [3]
            • customers can perform in-place read operations from Fabric whilst staying compliant [3] 
            • shortcuts can be accessed by other Fabric experiences [3]
    • {concept} data-in-transit
      • refers to traffic between Microsoft services routed over the Microsoft global network [2]
      • inbound communication
        • always encrypted with at least TLS 1.2. Fabric negotiates to TLS 1.3 whenever possible [2]
        • inbound protection
          •  concerned with how users sign in and have access to Fabric [3]
      • outbound communication to customer-owned infrastructure 
        • adheres to secure protocols [2]
          • {exception} might fall back to older, insecure protocols when newer protocols aren't supported [2]
            • incl. TLS 1
        • outbound protection
          • concerned with securely accessing data behind firewalls or private endpoints [3]


    References:
    [1] Microsoft Learn (2024) Security in Microsoft Fabric [link]
    [2] Microsoft Learn (2024) Microsoft Fabric security fundamentals [link]
    [3] Microsoft Learn (2024) Microsoft Fabric end-to-end security scenario [link]

    Resources:
    [R1] Microsoft Learn (2024) Microsoft Fabric security [link]
    [R2] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    ADL - Azure Data Lake
    API - Application Programming Interface
    CMK - Customer-Managed Keys
    MF - Microsoft Fabric
    MFA - Multifactor Authentication 
    MI - Managed Instance 
    PMK - Platform-Managed Keys
    REST - REpresentational State Transfer
    SDL - Security Development Lifecycle
    SKU - Stock Keeping Unit
    TLS  - Transport Layer Security
    VM - Virtual Machine
    VNet - virtual network
    VPN - Virtual Private Network

    23 March 2025

    💫🗒️ERP Systems: Microsoft Dynamics 365's Financial Tags [Notes]

    Disclaimer: This is work in progress intended to consolidate information from the various sources and not to provide a complete overview of all the features. Please refer to the documentation for a complete overview!

    Last updated: 23-Mar-2025

    [Dynamics 365] Financial Tags

    • {def} user-defined metadata elements used to track additional information on accounting entries for analytics or processes purpose
      • provide an additional layer of metadata
      • {objective} eliminate the need to use document numbers, descriptions, or financial dimensions [1]
        • stored on the accounting entries that are created for the transactions [1]
      • {benefit} improved accuracy 
        • ensure each transaction is linked with the correct accounting and auditing elements, enhancing the accuracy in financial reporting and compliance [8]
      • {benefit} streamlined processes 
        • by automating the categorization of financial transactions, financial tags affect a more efficient invoicing process [8]
      • {benefit} better financial track 
        •  allow for granular tracking of expenses and revenues, enabling more detailed financial analysis [8]
      • shown as separate columns on voucher transactions and similar GL inquiry forms 
      • legal entity specific
      • can be shared by using the Shared data feature [3]
      • designed to support any amount of reuse
      • do not default from master data
        • {feature|planned} defaulting will be enabled through user-defined rules
      • similar to financial dimensions
        • an alternative to creating financial dimensions
        • structured (account structures, account rules, validation) 
        • designed for medium to high reuse 
        • the two are most likely mutually exclusive
        • every transaction that supports dimensions will eventually support financial tags 
      • unstructured 
        • no structure, no rules, no validation
      • require a delimiter between the tag values
        • via General ledger parameters >> Financial tags
        • it can be deactivated but not deleted 
          • ⇐ helps ensure that the tag values remain available for reporting on posted general ledger entries can easily be activated and deactivated at any time
      • the label of each financial tag can be changed at any time, even after transactions are posted
        • if transactions have been posted for a specific financial tag, the tag values don't change
      • tag values
        • are associated with an accounting entry
        • can be reused 
        • have header to line defaulting
        • are stored as simple text 
        • do not reference other data 
        • are not validated at any time, including during entry and posting
        • can be entered or edited at any time prior to posting 
        • can be changed at any time after posting 
          • by enabling "Allow edits to internal data on general ledger vouchers" feature
      • up to 20 financial tags can be defined
        • e.g. Customers, Vendors, Projects, PO numbers, Payment references
        • each is 100 characters [1]
    • {type} text 
      • free text with no lookup 
    • {type} custom list
      • free text with lookup 
    • {type} list
      • predefined list of many common types of data with lookup 
        • list values are also not validated
    • supported by
      • general journals
      • customer and vendor payment journals, including entities 
    • {operation} editing
      • values can be entered or edited at any time prior to posting 
      • values can be changed at any time after posting 
        • by enabling "Allow edits to internal data on general ledger vouchers" feature
    • can be disabled at any time [1]
      • any values that were entered for financial tags on transactions will be maintained in the database [1]
        • values will no longer be visible on any transactions or in inquiries [1]
    • journals and transactions support for tags
      • [10.0.32] introduced
      • [10.0.37] [1]
        • general journal, including entities 
        • global general journal
        • allocation journal
        • fixed asset journal
        • all asset leasing journals
        • periodic journal
        • reporting currency adjustment journal
        • customer payment journal, including entities 
        • vendor payment journal, including entities 
        • invoice journal (vendor)
        • global invoice journal (vendor)
        • invoice register
        • SO documents 
          • Sales order, packing slip and customer invoice
          • {feature} "Enable financial tags for sales order invoicing"
        • voucher transactions and Transactions for [account] forms 
        • general journal account entry reporting entity 
        • ledger settlement (manual settlement)
      • [10.0.41|PP] PO documents
        • {feature} "Enable financial tags for purchase order invoicing"
    • {feature} [10.0.42] financial tag rules 
      • allow to enter default value or automatically populate values in financial tags [7]
      • {benefit} ensure consistency and efficiency in transaction tagging [7]
        • ⇐ essential for accurate financial tracking and reporting [7]
      • journals support [7]
        • general journal
        • global general journal
        • allocation journal
        • reporting currency adjustment journal
        • invoice journal (vendor)
      • {operation} Create a financial tag rule
        • via General ledger >> Chart of accounts >> Financial tags >> Financial tags >> New >>
      • {operation} Copy a financial tag rule within legal entity
        • copies a rule that is defined for one transaction entry point to another entry point in the same legal entity [7]
      • {operation} Copy a financial tag to other legal entity
        • copies rules to any legal entity where financial tags are defined and active. Select one or more rules to copy to another legal entity [7]
    • {feature} rule-based defaulting engine for financial tags 
      • e.g. default the vendor name to financial tag XX 
    • {feature} financial tag defaulting rules
    • {feature} valuate storing financial tags directly on subledger data 
      • e.g. store financial tag values in the bank subledger to use with advanced bank reconciliation matching rules
    Previous Post <<||>> Next Post

    References:
    [1] Microsoft Learn (2025) Dynamics 365 Finance: Financial tags [link]
    [2] Microsoft Learn (2025) Dynamics 365 Finance: Differences between financial tags and financial dimensions [link]
    [3] Microsoft Learn (2025) Dynamics 365 Finance: Microsoft Learn (2022) Financial dimensions [link]
    [4] Dynamics on Demand (2025) Financial Tags in Microsoft Dynamics 365 Finance | 10.0.32 [link]
    [5] Ramit Paul (2025) Financial Tags in Microsoft Dynamics 365 Finance and Operations [link]
    [6] Microsoft Learn (2025) Dynamics 365 Finance: Financial tag rule reference (preview) [link]
    [7] Microsoft Learn (2025) Dynamics 365 Finance: Financial tag rules (preview) [link]
    [8] Dynamics Global Edge IT Solutions (2024) Financial Tags For Purchase Order Invoicing In MS Dynamics365 F&O [link]

    Resources:
    [R1] Dynamics365lab (2024) Ep. 120:4 Exploring Financial Tags in Dynamics 365 F&O [link]
    [R2] Nextone Consulting (2024) New Feature: Financial Tag Rules in Dynamics 365 SCM 10.0.42 [link]
    [R3] Dynamics on Demand (2024) Financial Tags in Microsoft Dynamics 365 Finance | 10.0.32 [link]
    [R4] Axcademy (2023) Is this the end to Financial dimensions in D365FO as we know them? [link]
    [R5] HItachi Solutions (2024) New Feature in Dynamics 365 Finance - Financial Tags [link]

    Acronyms:
    D365 F&O - Dynamics 365 for Finance and Operations
    GL - General Ledger
    GA - General Availability
    LE - Legal Entity
    PO - Purchase Order
    PP - Public Preview
    SO - Sales Order

    22 March 2025

    💠🛠️🗒️SQL Server: Statistics Issues [Notes]

    Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation.

    Last updated: 22-Mar-2024

    [SQL Server 2005] Statistics

    • {def} objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view [2] [see notes]
    • {issue} inaccurate statistics (aka bad statistics)
      • the primary source for cardinality estimation errors [6]
      • guessed selectivity can lead to a less-than-optimal query plan
      • {cause} missing statistics
        • if an error or other event prevents the query optimizer from creating statistics, the query optimizer creates the query plan without using statistics [22] 
          • the query optimizer marks the statistics as missing and attempts to regenerate the statistics the next time the query is executed [22]
        • [SSMS] indicated as warnings when the execution plan of a query is graphically displayed [22]
      • [SQL Server Profiler] monitoring the Missing Column Statistics event class indicates when statistics are missing [22]
      • troubleshooting
        • {step} verify that AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are on [22]
        • verify that the database is not read-only [22]
          • the query optimizer cannot save statistics in read-only databases [22]
        • create the missing statistics by using the CREATE STATISTICS statement [22]
      • {cause} use of local variables
      • {cause} non-constant-foldable expressions
      • {cause} complex expressions
      • {cause} ascending key columns
        • can cause inaccurate statistics in tables with frequent INSERTs because new values all lie outside the histogram [3]
      • it is not necessary for statistics to be highly accurate to be effective [18]
    • {issue} stale statistics (aka outdated statistics)
      • statistics are outdated based on the number of modifications of the statistics (and index) columns [6]
      • SQL Server checks to see if the statistics are outdated when it looks up a plan from the cache [6]
        •  it recompiles the query if they are [6]
          • ⇐ triggers a statistics update [6]
      •  [temporary tables] can increase the number of recompilations triggered by outdated statistics [6]
      • [query hint] KEEPFIXED PLAN
        • prevents query recompilation in cases of outdated statistics [6]
          • queries are recompiled only when the schemas of the underlying tables are changed or the recompilation is forced
          • e.g. when a stored procedure is called using the WITH RECOMPILE clause [6]
      • can be identified based on
        • Stats_Date(object_id, stats_id) returns the last update date
        • Rowmodctr from sys.sysindexes
      • has negative impact on performance 
        • almost always outweighs the performance benefits of avoiding automatic statistics updates/creation [2]
          • ⇐ applies also to missing statistics [2]
          • auto update/create stats should be disabled only when thorough testing has shown that there's no other way to achieve the performance or scalability you require [2]
          • without correct and up-to-date statistical information, can be challenging to determine the best execution plan for a particular query [8]
    • {issue} missing statistics
      • {symptom} excessive number of logical reads 
        • often indicates suboptimal execution plans 
          • due to missing indexes and/or suboptimal join strategies selected 
            • because of incorrect cardinality estimation [6]
        • shouldn't be used as the only criteria during optimization 
          • further factors should be considered
            • resource usage, parallelism, related operators in the execution plan [6]
    • {issue} unused statistics 
      • table can end up having man statistics that serve no purpose
        • it makes sense to review and clean up the statistics as part of general maintenance
          • based on the thresholds for the automatic update
    • {issue} skewed statistics 
      • may happen when the sample rate doesn’t reflect the characteristics of values’ distribution
        • {solution} using a higher sample rate 
          • update statistics manually with fullscan
    • {issue} statistics accumulate over time 
      • statistics objects are persistent and will remain in the database forever until explicitly dropped
        • this becomes expensive over time 
      • {solution} drop all auto stats 
        • will place some temporary stress on the system [11]
        • all queries that adheres to a certain pattern that requires a statistics to be created, will wait [11]
        • typically in a matter of minutes most of the missing statistics will be already back in place and the temporary stress will be over [11]
        • {tip} the cleanup can be performed at off-peak hours and ‘warm-up’ the database by capturing and replaying a trace of the most common read only queries [11]
          • create many of the required statistics objects without impacting your ‘real’ workload [11]
        • {downside} statements may deadlock on schema locks with concurrent sessions [11]
    • {issue} overlapping statistics 
      • column statistics created by SQL Server based on queries that get executed can overlap with statistics for indexes created by users on the same column [2] 

    Previous Post <<||>> Next Post

    References:
    [2] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals
    [3] Kendal Van Dyke (2010) More On Overlapping Statistics [link
    [6] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.
    [18] Joe Chang’s Blog (2012) Decoding STATS_STREAM, by Joe Chang
    [22] MSDN (2016) Statistics [link]

    Resources:

    Acronyms:
    SSMS - SQL Server Management Studio
      

    💠🛠️🗒️SQL Server: Indexed Views [Notes]

    Disclaimer: This is work in progress based on notes gathered over the years, intended to consolidate information from the various sources. The content needs yet to be reviewed against the current documentation. 

    Last updated: 22-Mar-2024

    [SQL Server 2005] Indexed View

    • {def} a materialized view
      • materializes the data from the view queries, storing it in the database in a way similar to tables [6]
        • ⇒ its definition is computed and the resulting data stored just like a table [3]
        • the view is indexed by creating a unique clustered index on it
          • the resulting structure is physically identical to a table with a clustered index
            • ⇐ nonclustered indexes also are supported on this structure
        • can be created on a partitioned table, respectively can be partitioned [1]
      • {benefit} can improve the performance of some types of queries [3]
        • e.g. queries that aggregate many rows
        • ⇐ because the view is stored the same way a table with a clustered index is stored [1]
        • ⇐ not well-suited for underlying data that are frequently updated [3]
        •  more expensive to use and maintain than filtered indexes [5]
      • [query optimizer] 
        • can use it to speed up the query execution [1]
          • the view doesn't have to be referenced in the query for the optimizer to consider it for a substitution [1]
          • {downside} DML query performance can degrade significantly [1]
            • ⇐ in some cases, a query plan can't even be produced [1]
            • when executing UPDATE, DELETE or INSERT on the base table referenced, the indexed views must be updated as well [1]
            • {recommendation} test DML queries before production use [1]
              • analyze the query plan and tune/simplify the DML statemen [1]
        • can use the structure to return results more efficiently to the user
          • contains logic to use this index in either of the cases 
            • the original query text referenced the view explicitly [2]
            • the user submits a query that uses the same components as the view (in any equivalent order) [2]
            • ⇐ the query processor expands indexed views early in the query pipeline and always uses the same matching code for both cases [2]
              • the WITH(NOEXPAND) hint tells the query processor not to expand the view definition [2]
              • also instructs the query processor to perform an index scan of the indexed view rather than expand it into its component parts [5]
              • any extra rows in the indexed view are reported as 8907 errors [5]
              • any missing rows are reported as 8908 errors [5]
        • expose some of the benefits of view materialization while retaining the benefits of global reasoning about query operations [2]
        • expanded (aka in-lined) before optimization begins
          • gives the Query Optimizer opportunities to optimize queries globally [2]
          • makes it difficult for the (query) optimizer to consider plans that perform the view evaluation first, then process the rest of the query [2]
            • arbitrary tree matching is a computationally complex problem, and the feature set of views is too large to perform this operation efficiently [2]
        • cases in which it does not match the view
          • indexed views are inserted into the Memo and evaluated against other plan choices
            • while they are often the best plan choice, this is not always the case [2]
            • the Query Optimizer can detect logical contradictions between the view definition and the query that references the view [2]
          • there are also some cases where the Query Optimizer does not recognize an indexed view even when it would be a good plan choice [2]
            • often, these cases deal with complex interactions between high-level features within the query processor (e.g. computed column matching, the algorithm to explore join orders) [2]
            • consider the WITH (NOEXPAND) hint to force the query processor to pick that indexed view [2]
              •  this usually is enough to get the plan to include the indexed view [2]
          • indexed view alternatives 
            • are generated and stored in the Memo 
            • are compared using costing equations against other possible plans
            • partial matches cost the residual operations as well
              • an indexed-view plan can be generated but not picked when the Query Optimizer considers other plans to have lower costs [2]
          • maintained as part of the update processing for tables on which the view is based
            • this makes sure that the view provides a consistent result if it is selected by the Query Optimizer for any query plan [2]
            • some query operations are incompatible with this design guarantee
              • restrictions are placed on the set of supported constructs in indexed views to make sure that the view can be created, matched, and updated efficiently [2]
          • {operation} updating indexed views
            • the core question behind the restrictions is “Can the query processor compute the necessary changes to the Indexed View clustered and nonclustered indexes without having to recompute the whole indexed view?” [2]
              • if so, the query processor can perform these changes efficiently as part of the maintenance of the base tables that are referenced in the view[2]
                • this property is relatively easy for filters, projections (compute scalar), and inner joins on keys[2]
                • operators that destroy or create data are more difficult to maintain, so often these are restricted from use in indexed views. [2]
          • matching indexed views is supported in cases beyond exact matches of the query text to the view definition [2]
            • it also supports using an indexed view for inexact matches where the definition of the view is broader than the query submitted by the user [2]
              • then applies residual filters, projections (columns in the select list), and even aggregates to use the view as a partial precomputation of the query result [2]
      • {concept} statistics on indexed views
        • normally statistics aren't needed
          • because the substitution of the indexed views into the query plan is considered only after all the statistics for the underlying tables and indexes are attached to the query plan [3]
          • used if the view is directly referenced by the NOEXPAND hint in a FROM clause 
            • an error is generated and the plan is not created if the NOEXPAND hint is used on a view that does not also contain an index [3]
        • can’t be created by using sp_createstats or updated by using sp_updatestats. 
        • auto update and auto create statistics features work for indexed views
          • created manually
            • via CREATE STATISTICS on the indexed view columns
            • via UPDATE STATISTICS to update column or index statistics on indexed views
      • {operation} creating a view
        • requires that the underlying object’s schema can’t change
        • requires WITH SCHEMABINDING option [5]
        • ⇒ must include the two-part names of all referenced tables [5]
        • ⇐ the tables can't be dropped and the columns can't be altetd when participate in a view unless the view is tropped [5]
        • ⇐ an error is raised [5]
        • the user must hold 
          • the CREATE VIEW permission in the database [1]
          • ALTER permission on the schema in which the view is being created [1]
          • if the base table resides within a different schema, the REFERENCES permission on the table is required as a minimum [1]
          • if the user creating the index differs from the users who created the view, for the index creation alone the ALTER permission on the view is required [1]
      • {operation} creating an index on the view[
        • indexes can only be created on views that have the same owner as the referenced table or tables (aka intact ownership chain between the view and the tables) [1]
      • {operation} dropping a view
        • makes all indexes on the view to be dropped  [1]
          • ⇐ all nonclustered indexes and auto-created statistics on the view are dropped when the clustered index is dropped [1]
            • {exception} ser-created statistics on the view are maintained [1]
        • nonclustered indexes can be individually dropped [1]
        • dropping the clustered index on the view 
          • removes the stored result set [1]
          • the optimizer returns to processing the view like a standard view [1]
      • {operation} disable indexes on tables and views
        • when a clustered index on a table is disabled, indexes on views associated with the table are also disabled [1]
      • {option} EXPAND VIEWS
        • allows to prevent the Database Engine from using indexed views [1]
          • if any of the listed options are incorrectly set, this option prevents the optimizer from using the indexes on the views [1]
          • via OPTION (EXPAND VIEWS) hint
      • {recommendation} when using datetime and smalldatetime string literals in indexed views, explicitly convert the literal to the date type by using a deterministic date format style [1]
      • {limitation} AVG is not allowed {workaround} use SUM and COUNT_BIG (5]
      • {limitation} impacted by SET options [1]
        • {restriction} require fixed values for several SET options [1]
        • {recommendation} set the ARITHABORT user option to ON server-wide as soon as the first indexed view or index on a computed column is created in any database on the server [1]
      • {limitation} further requirements apply (see [1])
      • {limitation} aren't supported on top of temporal queries
        • ⇐ queries that use FOR SYSTEM_TIME clause).
      • {scenario}simplifying SQL queries
      • {scenario} abstracting data models from user models
      • {scenario} enforcing user security


    References:
    [1] Microsoft Learn (2024) SQL Server: Create indexed views [link]
    [2] Kalen Delaney et all (2009) Microsoft® SQL Server® 2008 Internals
    [3] Microsoft Learn (2024) SQL Server: Views [link]
    [4] Microsoft Learn (2024) SQL Server: CREATE INDEX (Transact-SQL) [link]
    [5] Kalen Delaney et all (2012) Microsoft® SQL Server® 2012 Internals
    [6] Dmitri Korotkevitch (2016) Pro SQL Server Internals 2nd Ed.

    Resources:
    [R1] Microsoft Learn (2024) SQL Server: Optimize index maintenance to improve query performance and reduce resource consumption [link]

    Acronyms:
    DML - Data Manipulation Language
    QO - Query Optimizer
    Related Posts Plugin for WordPress, Blogger...

    About Me

    My photo
    Koeln, NRW, Germany
    IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.