Showing posts sorted by date for query Data warehousing. Sort by relevance Show all posts
Showing posts sorted by date for query Data warehousing. Sort by relevance Show all posts

21 June 2025

🏭🗒️Microsoft Fabric: Result Set Caching in SQL Analytics Endpoints [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 21-Jun-2025

[Microsoft Fabric] Result Set Caching in SQL Analytics Endpoints

  • {def} built-in performance optimization for Warehouse and Lakehouse that improves read latency [1]
    • fully transparent to the user [3]
    • persists the final result sets for applicable SELECT T-SQL queries
      • caches all the data accessed by a query [3]
      • subsequent runs that "hit" cache will process just the final result set
        • can bypass complex compilation and data processing of the original query[1]
          • ⇐ returns subsequent queries faster [1]
      • the cache creation and reuse is applied opportunistically for queries
    • works on
      • warehouse tables
      • shortcuts to OneLake sources
      • shortcuts to non-Azure sources
    • the management of cache is handled automatically [1]
      • regularly evicts cache as needed
    • as data changes, result consistency is ensured by invalidating cache created earlier [1]
  • {operation} enable setting
    • via ALTER DATABASE <database_name> SET RESULT_SET_CACHING ON
  • {operation} validate setting
    • via SELECT name, is_result_set_caching_on FROM sys.databases
  • {operation} configure setting
    • configurable at item level
      • once enabled, it can then be disabled 
        • at the item level
        • for individual queries
          • e.g. debugging or A/B testing a query
        • via OPTION ( USE HINT ('DISABLE_RESULT_SET_CACHE') 
    • {default} during the preview, result set caching is off for all items [1]
  • [monitoring] 
    • via Message Output
      • applicable to Fabric Query editor, SSMS
      • the statement "Result set cache was used" is displayed after query execution if the query was able to use an existing result set cache
    • via queryinsights.exec_requests_history system view
      • result_cache_hit displays indicates result set cache usage for each query execution [1]
        • {value} 2: the query used result set cache (cache hit)
        • {value} 1: the query created result set cache
        • {value} 0: the query wasn't applicable for result set cache creation or usage [1]
          • {reason} the cache no longer exists
          • {reason} the cache was invalidated by a data change, disqualifying it for reuse [1]
          • {reason} query isn't deterministic
            • isn't eligible for cache creation [1]
          • {reason} query isn't a SELECT statement
  • [warehousing] 
    • {scenario} analytical queries that process large amounts of data to produce a relatively small result [1]
    • {scenario} workloads that trigger the same analytical queries repeatedly [1]
      • the same heavy computation can be triggered multiple times, even though the final result remains the same [1]

References:
[1] Microsoft Learn (2025) Result set caching (preview) [link]
[2] Microsoft Fabric Update Blog (2025) Result Set Caching for Microsoft Fabric Data Warehouse (Preview) [link|aka]
[3] Microsoft Learn (2025) In-memory and disk caching [link]
[4] Microsoft Learn (2025) Performance guidelines in Fabric Data Warehouse [link

Resources:
[R1] Microsoft Fabric (2025) Fabric Update - June 2025 [link]

Acronyms:
MF - Microsoft Fabric
SSMS - SQL Server Management Studio

18 May 2025

#️⃣Software Engineering: Mea Culpa (Part VII: A Look Forward)

Software Engineering Series
Software Engineering Series

I worked for more than 20 years in various areas related to ERP systems - Data Migrations, Business Intelligence/Analytics, Data Warehousing, Data Management, Project Management, (data) integrations, Quality Assurance, and much more, having experience with IFS IV, Oracle e-Business Suite, MS Dynamics AX 2009 and during the past 3-7 years also with MS Dynamics 365 Finance, SCM & HR (in that order). Much earlier, I started to work with SQL Server (2000-2019), Oracle, and more recently with Azure Synapse and MS Fabric, writing over time more than 800 ad-hoc queries and reports for the various stakeholders, covering all the important areas, respectively many more queries for monitoring the various environments. 

In the areas where I couldn’t acquire experience on the job, I tried to address this by learning in my free time. I did it because I take seriously my profession, and I want to know how (some) things work. I put thus a lot of time into trying to keep actual with what’s happening in the MS Fabric world, from Power BI to KQL, Python, dataflows, SQL databases and much more. These technologies are Microsoft’s bet, though at least from German’s market perspective, all bets are off! Probably, many companies are circumspect or need more time to react to the political and economic impulses, or probably some companies are already in bad shape. 

Unfortunately, the political context has a broad impact on the economy, on what’s happening in the job market right now! However, the two aspects are not the only problem. Between candidates and jobs, the distance seems to grow, a dense wall of opinion being built, multiple layers based on presumptions filtering out voices that (still) matter! Does my experience matter or does it become obsolete like the technologies I used to work with? But I continued to learn, to keep actual… Or do I need to delete everything that reminds the old?

To succeed or at least be hired today one must fit a pattern that frankly doesn’t make sense! Yes, soft skills are important though not all of them are capable of compensating for the lack of technical skills! There seems to be a tendency to exaggerate some of the qualities associated with skills, or better said, of hiding behind big words. Sometimes it feels like a Shakespearian inaccurate adaptation of the stage on which we are merely players.

More likely, this lack of pragmatism will lead to suboptimal constructions that will tend to succumb under their own structure. All the inefficiencies need to be corrected, or somebody (or something) must be able to bear their weight. I saw this too often happening in ERP implementations! Big words don’t compensate for the lack of pragmatism, skills, knowledge, effort or management! For many organizations the answer to nowadays problems is more management, which occasionally might be the right approach, though this is not a universal solution for everything that crosses our path(s).

One of society’s answers to nowadays’ problem seems to be the refuge in AI. So, I wonder – where I’m going now? Jobless, without an acceptable perspective, with AI penetrating the markets and making probably many jobs obsolete. One must adapt, but adapt to what? AI is brainless even if it can mimic intelligence! Probably, it can do more in time to the degree that many more jobs will become obsolete (and I’m wondering what will happen to all those people). 

Conversely, to some trends there will be probably other trends against them, however it’s challenging to depict in clear terms the future yet in making. Society seems to be at a crossroad, more important than mine.

Previous Post <<||>> Next Post

26 March 2025

💠🏭🗒️Microsoft Fabric: Polaris SQL Pool [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources and may deviate from them. Please consult the sources for the exact content!

Unfortunately, besides the references papers, there's almost no material that could be used to enhance the understanding of the concepts presented. 

Last updated: 26-Mar-2025

Read and Write Operations in Polaris [2]

[Microsoft Fabric] Polaris SQL Pool

  • {def} distributed SQL query engine that powers Microsoft Fabric's data warehousing capabilities
    • designed to unify data warehousing and big data workloads while separating compute and state for seamless cloud-native operations
    • based on a robust DCP 
      • designed to execute read-only queries in a scalable, dynamic and fault-tolerant way [1]
      • a highly-available micro-service architecture with well-defined responsibilities [2]
        • data and query processing is packaged into units (aka tasks) 
          • can be readily moved across compute nodes and re-started at the task level
        • widely-partitioned data with a flexible distribution model [2]
        • a task-level "workflow-DAG" that is novel in spanning multiple queries [2]
        • a framework for fine-grained monitoring and flexible scheduling of tasks [2]
  • {component} SQL Server Front End (SQL-FE)
    • responsible for 
      • compilation
      • authorization
      • authentication
      • metadata
        • used by the compiler to 
          • {operation} generate the search space (aka MEMO) for incoming queries
          • {operation} bind metadata to data cells
          • leveraged to ensure the durability of the transaction manifests at commit [2]
            • only transactions that successfully commit need to be actively tracked to ensure consistency [2]
            • any manifests and data associated with aborted transactions are systematically garbage-collected from OneLake through specialized system tasks [2]
  • {component} SQL Server Backend (SQL-BE)
    • used to perform write operations on the LST [2]
      • inserting data into a LST creates a set of Parquet files that are then recorded in the transaction manifest [2]
      • a transaction is represented by a single manifest file that is modified concurrently by (one or more) SQL BEs [2]
        • SQL BE leverages the Block Blob API provided by ADLS to coordinate the concurrent writes  [2]
        • each SQL BE instance serializes the information about the actions it performed, either adding a Parquet file or removing it [2]
          • the serialized information is then uploaded as a block to the manifest file
          • uploading the block does not yet make any visible changes to the file [2]
            • each block is identified by a unique ID generated on the writing SQL BE [2]
        • after completion, each SQL BE returns the ID of the block(s) it wrote to the Polaris DCP [2]
          • the block IDs are then aggregated by the Polaris DCP and returned to the SQL FE as the result of the query [2]
      • the SQL FE further aggregates the block IDs and issues a Commit Block operation against storage with the aggregated block IDs [2]
        • at this point, the changes to the file on storage will become effective [2]
      • changes to the manifest file are not visible until the Commit operation on the SQL FE
        • the Polaris DCP can freely restart any part of the operation in case there is a failure in the node topology [2]
      • the IDs of any blocks written by previous attempts are not included in the final list of block IDs and are discarded by storage [2]
    • [read operations] SQL BE is responsible for reconstructing the table snapshot based on the set of manifest files managed in the SQL FE
      • the result is the set of Parquet data files and deletion vectors that represent the snapshot of the table [2]
        • queries over these are processed by the SQL Server query execution engine [2]
        • the reconstructed state is cached in memory and organized in such a way that the table state can be efficiently reconstructed as of any point in time [2]
          • enables the cache to be used by different operations operating on different snapshots of the table [2]
          • enables the cache to be incrementally updated as new transactions commit [2]
  • {feature} supports explicit user transactions
    • can execute multiple statements within the same transaction in a consistent way
      • the manifest file associated with the current transaction captures all the (reconciled) changes performed by the transaction [2]
        • changes performed by prior statements in the current transaction need to be visible to any subsequent statement inside the transaction (but not outside of the transaction) [2]
    • [multi-statement transactions] in addition to the committed set of manifest files, the SQL BE reads the manifest file of the current transaction and then overlays these changes on the committed manifests [1]
    • {write operations} the behavior of the SQL BE depends on the type of the operation.
      • insert operations 
        • only add new data and have no dependency on previous changes [2]
        • the SQL BE can serialize the metadata blocks holding information about the newly created data files just like before [2]
        • the SQL FE, instead of committing only the IDs of the blocks written by the current operation, will instead append them to the list of previously committed blocks
          • ⇐ effectively appends the data to the manifest file [2]
    • {update|delete operations} 
      • handled differently 
        • ⇐ since they can potentially further modify data already modified by a prior statement in the same transaction [2]
          • e.g. an update operation can be followed by another update operation touching the same rows
        • the final transaction manifest should not contain any information about the parts from the first update that were made obsolete by the second update [2]
      • SQL BE leverages the partition assignment from the Polaris DCP to perform a distributed rewrite of the transaction manifest to reconcile the actions of the current operation with the actions recorded by the previous operation [2]
        • the resulting block IDs are sent again to the SQL FE where the manifest file is committed using the (rewritten) block IDs [2]
  • {concept} Distributed Query Processor (DQP)
    • responsible for 
      • distributed query optimization
      • distributed query execution
      • query execution topology management
  • {concept} Workload Management (WLM)
    •  consists of a set of compute servers that are, simply, an abstraction of a host provided by the compute fabric, each with a dedicated set of resources (disk, CPU and memory) [2]
      • each compute server runs two micro-services
        • {service} Execution Service (ES) 
          • responsible for tracking the life span of tasks assigned to a compute container by the DQP [2]
        • {service} SQL Server instance
          • used as the back-bone for execution of the template query for a given task  [2]
            • ⇐ holds a cache on top of local SSDs 
              • in addition to in-memory caching of hot data
            • data can be transferred from one compute server to another
              • via dedicated data channels
              • the data channel is also used by the compute servers to send results to the SQL FE that returns the results to the user [2]
              • the life cycle of a query is tracked via control flow channels from the SQL FE to the DQP, and the DQP to the ES [2]
  • {concept} cell data abstraction
    • the key building block that enables to abstract data stores
      • abstracts DQP from the underlying store [1]
      • any dataset can be mapped to a collection of cells [1]
      • allows distributing query processing over data in diverse formats [1]
      • tailored for vectorized processing when the data is stored in columnar formats [1] 
      • further improves relational query performance
    • 2-dimenstional
      • distributions (data alignment)
      • partitions (data pruning)
    • each cell is self-contained with its own statistics [1]
      • used for both global and local QO [1]
      • cells can be grouped physically in storage [1]
      • queries can selectively reference either cell dimension or even individual cells depending on predicates and type of operations present in the query [1]
    • {concept} distributed query processing (DQP) framework
      • operates at the cell level 
      • agnostic to the details of the data within a cell
        • data extraction from a cell is the responsibility of the (single node) query execution engine, which is primarily SQL Server, and is extensible for new data types [1], [2]
  • {concept} dataset
    • logically abstracted as a collection of cells [1] 
    • can be arbitrarily assigned to compute nodes to achieve parallelism [1]
    • uniformly distributed across a large number of cells 
      • [scale-out processing] each dataset must be distributed across thousands of buckets or subsets of data objects,
      •  such that they can be processed in parallel across nodes
  • {concept} session
    • supports a spectrum of consumption models, ranging from serverless ad-hoc queries to long-standing pools or clusters [1]
    • all data are accessible from any session [1]
      • multiple sessions can access all underlying data concurrently  [1]
  • {concept} Physical Metadata layer
    • new layer introduced in the SQL Server storage engine [2]
See also: Polaris

References:
[1] Josep Aguilar-Saborit et al (2020) POLARIS: The Distributed SQL Engine in Azure Synapse, Proceedings of the VLDB Endowment PVLDB 13(12) [link]
[2] Josep Aguilar-Saborit et al (2024), Extending Polaris to Support Transactions [link]
[3] Gjnana P Duvvuri (2024) Microsoft Fabric Warehouse Deep Dive into Polaris Analytic Engine [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
[R2] Patrick Pichler (2023) Data Warehouse (Polaris) vs. Data Lakehouse (Spark) in Microsoft Fabric [link]
[R3] Tiago Balabuch (2023) Microsoft Fabric Data Warehouse - The Polaris engine [link]

Acronyms:
CPU - Central Processing Unit
DAG - Directed Acyclic Graph
DB - Database
DCP - Distributed Computation Platform 
DQP - Distributed Query Processing 
DWH - Data Warehouses 
ES - Execution Service
LST - Log-Structured Table
SQL BE - SQL Backend
SQL FE - SQL Frontend
SSD - Solid State Disk
WAL - Write-Ahead Log
WLM - Workload Management

25 March 2025

🏭🗒️Microsoft Fabric: Security in Warehouse [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 25-Mar-2024

[Microsoft Fabric] Security in Warehouse
  • {def} suite of technologies aimed at safeguarding sensitive information in Fabric [1]
    • leverages SQL engine’s security features [1]
      • allows for security mechanism at the warehouse level [1]
      • ⇐ the warehouse and SQL analytics endpoint items also allow for the defining of native SQL security [4]
        • the permissions configured only apply to the queries executed against the respective surfaces [4]
      • the access to OneLake data is controlled separately through OneLake data access roles [4]
        • {recommendation} to ensure users with SQL specific permissions don't see data they don't have SQL access to, don't include those users in a OneLake data access role [4]
    • supports a range of data protection features that enable administrators to shield sensitive data from unauthorized access [1]
      • ⇐ across warehouses and SQL analytics endpoints without necessitating changes to applications [1]
    • {type} object-level security (OLS)
      • permissions governing DML operations [1]
        • applies to tables and views
        • ⇐ when denied, the user will be prevented from performing the respective operation
        • SELECT
          • allows users to view the data within the object [1]
        • INSERT
          • allows users to insert data in the object [1]
        • UPDATE
          • allows users to update data within the object [1]
        • DELETE
          • allows users to delete the data within the object [1]
      • permissions can be granted, revoked or denied on database objects [1]
        •  tables and views
        • GRANT
          • permission is granted to user or role [1]
        • DENY
          • permissions is denied to user or role [1]
        • REVOKE
          • permissions is revoked to user or role [1]
        • ALTER
          • grants the user the ability to change the definition of the object [1]
        • CONTROL
          • grants the user all rights to the object [1]
      • {principle} least privilege
        • users and applications should only be given the permissions needed in order for them to complete the task
    • {type} column-level security (CLS)
      • allows to restrict column access to sensitive data [1]
        • provides granular control over who can access specific pieces of data [1]
          •  enhances the overall security of the data warehouse [1]
      • steps
        • identify the sensitive columns [1]
        • define access roles [1]
        • assign roles to users [1]
        • implement access control [1]
          • restrict access to ta column based on user's role [1]
    • {type} row-level security (RLS)
      • provides granular control over access to rows in a table based on group membership or execution context [1]
        • using WHERE clause filters [1]
      • works by associating a function (aka security predicate) with a table [1]
        • defined to return true or false based on certain conditions [1]
          • ⇐ typically involving the values of one or more columns in the table [1]
          • when a user attempts to access data in the table, the security predicate function is invoked [1]
            • if the function returns true, the row is accessible to the user; otherwise, the row doesn't show up in the query results [1]
        • the predicate can be as simple/complex as required [1]
        • the process is transparent to the user and is enforced automatically by SQL Server
          • ⇐ ensures consistent application of security rules [1]
      • implemented in two main steps:
        • filter predicates 
          • an inline table-valued function that filters the results based on the predicate defined [1]
        • security policy
          • invokes an inline table-valued function to protect access to the rows in a table [1]
            • because access control is configured and applied at the warehouse level, application changes are minimal - if any [1]
            • users can directly have access to the tables and can query their own data [1]
      • {recommendation} create a separate schema for predicate functions, and security policies [1]
      • {recommendation} avoid type conversions in predicate functions [1]
      • {recommendation} to maximize performance, avoid using excessive table joins and recursion in predicate functions [1]
    • {type} dynamic data masking (DDM) 
      • allows to limits data exposure to nonprivileged users by obscuring sensitive data [1]
        • e.g. email addresses 
      • {benefit} enhance the security and manageability of the data [1]
      • {feature} real-time masking
        • when querying sensitive data, DDM applies dynamic masking to it in real time [1]
          • the actual data is never exposed to unauthorized users, thus enhancing the security of your data [1]
        • straightforward to implement [1]
        • doesn’t require complex coding, making it accessible for users of all skill levels [1]
        • {benefit} the data in the database isn’t changed when DDM is applied
          •   the actual data remains intact and secure, while nonprivileged users only see a masked version of the data [1]
      • {operation} define masking rule
        • set up at column level [1]
        • offers a suite of features [1]
          • comprehensive and partial masking capabilities [1]
          • supports several masking types
            • help prevent unauthorized viewing of sensitive data [1]
              • by enabling administrators to specify how much sensitive data to reveal [1]
                •   minimal effect on the application layer [1]
            • applied to query results, so the data in the database isn't changed 
              •   allows many applications to mask sensitive data without modifying existing queries  [1]
          • random masking function designed for numeric data [1]
        • {risk} unprivileged users with query permissions can infer the actual data since the data isn’t physically obfuscated [1]
      • {recommendation} DDM should be used as part of a comprehensive data security strategy [1]
        • should include
          • the proper management of object-level security with SQL granular permissions [1]
          • adherence to the principle of minimal required permissions [1]
    • {concept} Dynamic SQL 
      • allows T-SQL statements to be generated within a stored procedure or a query itself [1]
        • executed via sp_executesql stored procedure
      • {risk} SQL injection attacks
        • use  QUOTENAME to sanitize inputs [1]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} granted through the Fabric workspace roles
      • the role automatically translates to a corresponding role in SQL that grants equivalent write access [4]
      • {recommendation} if a user needs write access to all warehouses and endpoints, assign the user to a workspace role [4]
        • use the Contributor role unless the user needs to assign other users to workspace roles [4]
      • {recommendation} grant direct access through SQL permissions if the user only needs to write to specific warehouses or endpoints [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to write to some or all the data [4]
  • write access to a warehouse or SQL analytics endpoint
    • {approach} grant read access through the ReadData permission, granted as part of the Fabric workspace roles [4]
      •  ReadData permission maps the user to a SQL role that gives SELECT permissions on all tables in the warehouse or lakehouse
        • helpful if the user needs to see all or most of the data in the lakehouse or warehouse [4]
        • any SQL DENY permissions set on a particular lakehouse or warehouse still apply and limit access to tables [4]
        • row and column level security can be set on tables to restrict access at a granular level [4]
    • {approach} grant read access to the SQL engine, and grant custom SQL permissions to read to some or all the data [4]
    • if the user needs access only to a specific lakehouse or warehouse, the share feature provides access to only the shared item [4]
      • during the share, users can choose to give only Read permission or Read + ReadData 
        • granting Read permission allows the user to connect to the warehouse or SQL analytics endpoint but gives no table access [4]
        • granting users the ReadData permissions gives them full read access to all tables in the warehouse or SQL analytics endpoint
      • ⇐ additional SQL security can be configured to grant or deny access to specific tables [4]

    References:
    [1] Microsoft Learn (2024) Secure a Microsoft Fabric data warehouse [link]
    [2] Data Mozart (2025) Lock Up! Understanding Data Access Options in Microsoft Fabric, by Nikola Ilic [link]
    [3] Microsoft Learn (2024) Security in Microsoft Fabric [link]
    [4] Microsoft Learn (2024) Microsoft Fabric: How to secure a lakehouse for Data Warehousing teams [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]
    [R2] Microsoft Learn (2025) Fabric: Security for data warehousing in Microsoft Fabric [link]
    [R3] Microsoft Learn (2025) Fabric: Share your data and manage permissions [link]

    Acronyms:
    CLS - Column-Level Security
    DDM - Dynamic Data Masking
    DML - Data Manipulation Language 
    MF - Microsoft Fabric
    OLS - Object-Level Security
    RLS - Row-Level Security
    SQL - Structured Query Language

    18 March 2025

    🏭🗒️Microsoft Fabric: Statistics in Warehouse [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 18-Mar-2024

    [Microsoft Fabric] Statistics
    • {def} objects that contain relevant information about data, to allow query optimizer to estimate plans' costs [1]
      • critical for the warehouse and lakehouse SQL endpoint for executing queries quickly and efficiently [3]
        • when a query is executed, the engine tries to collect existing statistics for certain columns in the query and use that information to assist in choosing an optimal execution plan [3]
        • inaccurate statistics can lead to unoptimized query plans and execution times [5]
    • {type} user-defined statistics
      • statistics defined manually by the users via DDL statement [1]
      • users can create, update and drop statistic 
        • via CREATE|UPDATE|DROP STATISTICS
      • users can review the contents of histogram-based single-column statistics [1]
        • via DBCC SHOW_STATISTICS
          • only, a limited version of these statements is supported [1]
      • {recommendation} focus on columns heavily used in query workloads
        • e.g. GROUP BYs, ORDER BYs, filters, and JOINs
      • {recommendation} consider updating column-level statistics regularly [1]
        • e.g. after data changes that significantly change rowcount or distribution of the data [1]
    • {type} automatic statistics
      • statistics created and maintained automatically by the query engine at query time [1]
      • when a query is issued and query optimizer requires statistics for plan exploration, MF automatically creates those statistics if they don't already exist [1]
        • then the query optimizer can utilize them in estimating the plan costs of the triggering query [1]
        • if the query engine determines that existing statistics relevant to query no longer accurately reflect the data, those statistics are automatically refreshed [1]
          • these automatic operations are done synchronously [1]
            • the query duration includes this time [1]
    • {object type} histogram statistics
      • created per column needing histogram statistics at query time [1]
      • contains histogram and density information regarding the distribution of a particular column [1]
      • similar to the statistics automatically created at query-time in Azure Synapse Analytics dedicated pools [1]
      • name begins with _WA_Sys_.
      • contents can be viewed with DBCC SHOW_STATISTICS
    • {object type} average column length statistics
      • created for variable character columns (varchar) greater than 100 needing average column length at query-time [1]
      • contain a value representing the average row size of the varchar column at the time of statistics creation [1]
      • name begins with ACE-AverageColumnLength_
      • contents cannot be viewed and are nonactionable by users [1]
    • {object type} table-based cardinality statistics
      • created per table needing cardinality estimation at query-time [1]
      • contain an estimate of the rowcount of a table [1]
      • named ACE-Cardinality [1]
      • contents cannot be viewed and are nonactionable by user [1]
    • [lakehouse] SQL analytics endpoint
      • uses the same engine as the warehouse to serve high performance, low latency SQL queries [4]
      • {feature} automatic metadata discovery
        • a seamless process reads the delta logs and from the files folder and ensures SQL metadata for tables is always up to date [4]
          • e.g. statistics [4]
    • {limitation} only single-column histogram statistics can be manually created and modified [1]
    • {limitation} multi-column statistics creation is not supported [1]
    • {limitation} other statistics objects might appear in sys.stats
      • besides the statistics created manually/automatically [1]
        • ⇐ the objects are not used for query optimization [1]
    • {limitation} if a transaction has data insertion into an empty table and issues a SELECT before rolling back, the automatically generated statistics can still reflect the uncommitted data, causing inaccurate statistics [5]
      • {recommendation} update statistics for the columns mentioned in the SELECT [5]
    • {recommendation} ensure all table statistics are updated after large DML transactions [2]


    References:
    [1] Microsoft Learn (2025) Fabric: Statistics in Fabric data warehousing [link
    [2] Microsoft Learn (2025) Fabric: Troubleshoot the Warehouse [link
    [3] Microsoft Fabric Updates Blog (2023) Microsoft Fabric July 2023 Update [link]
    [4] Microsoft Learn (2024) Fabric: Better together: the lakehouse and warehouse [link]
    [5] Microsoft Learn (2024) Fabric: Transactions in Warehouse tables in Microsoft Fabric [link]

    Resources:
    [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

    Acronyms:
    DDL - Data Definition Language
    MF - Microsoft Fabric

    17 March 2025

    🏭🗒️Microsoft Fabric: Caching in Warehouse [Notes]

    Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

    Last updated: 17-Mar-2024

    [Microsoft Fabric] Caching
    • {def} technique that improves the performance of data processing by storing frequently accessed data and metadata in a faster storage layer [1]
      • e.g. local memory, local SSD disk
      • ⇐ subsequent requests can be served faster, directly from the cache [1]
        • if a set of data has been previously accessed by a query, any subsequent queries will retrieve that data directly from the in-memory cache [1]
        • local memory operations are notably faster compared to fetching data from remote storage [1]
        • ⇐ significantly diminishes IO latency [1]
      • fully transparent to the user
      • consistently active and operates seamlessly in the background [1]
      • orchestrated and upheld by MF
        • it doesn't offer users the capability to manually clear the cache [1] 
      • provides transactional consistency
        • ensures that any modifications to the data in storage after it has been initially loaded into the in-memory cache, will result in consistent data [1]
      • when the cache reaches its capacity threshold and fresh data is being read for the first time, objects that have remained unused for the longest duration will be removed from the cache [1]
        • process is enacted to create space for the influx of new data and maintain an optimal cache utilization strategy [1] 
    • {type} in-memory cache
      • data in cache is organized in a compressed columnar format (aka columnar storage) [1]
        • ⇐ optimized for analytical queries
        • each column of data is stored separately [1]
          • {benefit} allows for better compression [1]
            • since similar data values are stored together [1]
          • {benefit} reduces the memory footprint
      • when queries need to perform operations on a specific column, the engine can work more efficiently speeding up queries' execution [1]
        • it doesn't have to process unnecessary data from other columns
        • can perform operations on multiple columns simultaneously [1]
          • taking advantage of modern multi-core processors [1]
      • when retrieves data from storage, the data is transformed from its original file-based format into highly optimized structures in in-memory cache [1]
      • {scenario} analytical workloads where queries involve scanning large amounts of data to perform aggregations, filtering, and other data manipulations [1]
    • {type} disk cache
      • complementary extension to the in-memory cache
      • any data loaded into the in-memory cache is also serialized to the SSD cache [1]
        • data removed from the in-memory cache remains within the SSD cache for an extended period
        • when subsequent query requests the data, it is retrieved from the SSD cache into the in-memory cache quicker [1]
    • {issue} cold run (aka cold cache) performance
      • the first 1-3 executions of a query perform noticeably slower than subsequent executions [2]
        • if the first run's performance is crucial, try manually creating statistics (aka pre-warming the cache) [2]
        • otherwise, one can rely on the automatic statistics generated in the first query run and leveraged in subsequent runs [2]
        • as long as underlying data does not change significantly [2]
    • differentiated from
      • [Kusto] caching policy [link]
      • [Apache Spark] intelligent cache [link]
      • [Power BI] query caching [link]
      • [Azure] caching [link]
      References:
      [1] Microsoft Learn (2024) Fabric: Caching in Fabric data warehousing [link]
      [2] Microsoft Learn (2024) Fabric Data Warehouse performance guidelines [link]
      [3] Sandeep Pawar (2023) Pre-Warming The Direct Lake Dataset For Warm Cache Import-Like Performance [link]

      Resources:
      [R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

      Acronyms:
      IO - Input/Output
      MF - Microsoft Fabric
      SSD - Solid State Drive

      15 March 2025

      💫🗒️ERP Systems: Microsoft Dynamics 365's Business Performance Analytics (BPA) [notes]

      Disclaimer: This is work in progress intended to consolidate information from the various sources and not to provide a complete overview of all the features. Please refer to the documentation for a complete overview!

      Last updated: 15-Mar-2025

      [Dynamics 365] Business Performance Analytics (BPA)

      • {def} centralized reporting hub within D365 F&O designed to streamline insights and help organizations make faster, data driven decisions [3]
        • solution designed to transform organization's data into actionable insights [1]
        • provides an intuitive out-of-box data model along with familiar tools like Microsoft Excel and Power BI for self-service analytics [4]
          • data extracted from D365 is classified in BPA in the form of value chains
            • ⇐ a group of business processes on top of the value chain [4]
      • {benefit} allows to simplify data insights by providing a unified view of business data across entities in near real time [4]
      • {benefit} allows to streamline financial and operations reporting to reduce the cycle times [4]
      • {benefit} allows users of all technical abilities to quickly access and analyze data to facilitate data driven decisions [4]
      • {benefit} provides auditors with direct access to financial data, making the audit process more efficient
      • {benefit} enables ease of use through familiar apps like Excel and Power BI, in addition to AI driven insights and automation in this platform that can be scalable and extendable [4]
      • {feature} extends into Microsoft Fabric
        • {benefit} provide a scalable, secure environment for handling large data sets and ensuring insights are always powered by the latest technology [3]
      • {feature} ETL process 
        • involves extracting data from finance and operations database, transforming and loading it into Dataverse [4]
          • each of the entities required for the generation of the dimensional model for the value chains that were mentioned earlier, they are backed by the underlying tables in finance and operations database [4]
        • installed in Dataverse, virtual  entities that are created will then pull in the data into the managed data lake [4]
        • the data is then transformed to generate the dimensional  model which is then pushed into the embedded Power BI workspace in the form of analytical tables [4]
        • BPA consumes this data from Power BI workspace to render the power BI reports [4]
        • this data can also be extended to Fabric if there is a need to consolidate data from multiple sources [4]
      • {feature} reports 
        • designed to provide a detailed overview of an organization's financial health [8]
        • further reports will be added to expand the coverage for the value chains [8]
        • out-of-box reports can't be modified
          • ⇐ users cannot rename, delete or edit these type of reports [8]
          • there’s the option to duplicate the base report and edit the version thus created [8]
        • can be shared with other users who have access to BPA 
          • ⇐ they can receive an in-app notification [8]
          • can be shared over email with another user by entering user’s email address [8] 
          • one can configure whether the recipient can edit or view the report [8]
        •   {feature} allows to create a new Power BI or Excel report from scratch [8]
          • {option} start with a blank report or duplicate an existing report [8]
      • {feature} data refresh
        • automatic data refreshes run currently two times a day [4]
          • at 12:00 AM and 12:00 PM UTC
          • the volume of data is also constrained by the storage capacity of the A3 SKU for Power BI Embedded [1]
            • future release, may support additional data reporting capacity [1]
              • ⇐ so that larger data sets can be reported and analyzed [1]
          • the target is to have refreshes every hour or less [3]
        • data volume will be initially for about eight quarters of data [4]
        • extensibility will be supported with bring your own Fabric [4]
      • architecture
        • SaaS solution
          • {capability} immediate deployment 
            • businesses can start to analyze data and generate insights with minimal setup [1]
          • {capability} comprehensive reporting and dashboards
            • provides access to a wide range of preconfigured reports that cover multiple business functions [1]
          • {capability} near-real-time analytics 
            • future releases will offer more frequent data refreshes to enable near-real-time data analysis and reporting
          • {capability} predictive insights 
            • future releases will introduce predictive analytics capabilities that enable businesses to 
              • forecast trends
              • identify risks
              • seize opportunities [1]
          • {capability} user-friendly interface 
            • intuitive design ⇒ minimal training
              • fosters broader adoption 
              • enables a data-driven culture across the organization [1]
          • {capability} cost-effectiveness
            • available as part of D365 license
              • ⇒ provides advanced analytics without requiring significant investments in IT infrastructure [1]
        • DaaS solution
          • {capability} organizations can integrate its data models with their existing data warehousing infrastructure in Microsoft Fabric [1]
            • maximizes the value of existing data solutions [1]
            • positions businesses for future enhancements [1]
          • {capability} unified and scalable data models
            • customers can build custom models on top of a unified framework
              • ensures consistency and scalability across data sets [1]
          • {capability} future-proofing with automatic upgrades
            • data models integrate seamlessly with future D365 updates
              • reduces manual maintenance and ensures access to the latest features [1]
          • {capability} consistency and standardization
            • data models provide consistency and standardization across data sources
              • ensure high data quality and integrity [1]
          • {capability} advanced analytics and AI 
            • by customizing the data models, organizations can take advantage of advanced analytics and AI capabilities [1]
              • deeper insights without having to develop them from scratch [1]
          • {capability} enhanced data governance
            • unified data models support better data governance by providing standardized data definitions, relationships, and hierarchies [1]
              • ensure consistency and quality across the organization [1]
        • requires an integrated Power Platform environment [5]
          • must be integrated with the Microsoft Entra tenant [5]
        • uses shared Dataverse entitlements [1]
          • includes access to the data lake [1]
      • setup
        • dimensions
          • the selection of dimensions might affect the dimension groups that are created using these dimensions and the users who are assigned there [7]
            • e.g. legal entity, business unit
        • dimension groups
          • users can select specific values for the legal entity, or add a range of values [7]
            • selecting an invalid combination of dimension values, the dimension group will filter out all the records on the report [7]
          • {warning} assigning too many dimension groups to a user, slows the load for that user [7]
        • roles
          • determine which reports the user can access [7]
      • security
        • secure data through role-based access control on top of the value chains [7]
        • the first user who signs into the app is assigned the BPA admin role [7]
          • allows a user to access the administrator section of the BPA [7]
            • where the security can be set up [7]
          • has automatically assigned 
            • Microsoft report viewer role 
            • the All Access Dimension group [7]
              • allow the admin to see the data  in all the reports across all the dimensions [7]
        • {feature} dimension-based role-level security
          • ensures that users only see the data relevant to them based on their role
            •  confidently share reports without duplicating them
              • ⇐ data is automatically filtered by organization's security policies [3]
          • simple but powerful way to maintain control while providing access for teams that love working in Excel [3]
      • accessibility
        • can be accessed through either 
          • Power Platform
            • admins can access BPA app through PowerApps' makeup portal [6]
          • Dynamics 365
            • through the BPA preview shortcut in the homepage or the default dashboard [6]
            • for end users, the BPA preview shortcut is provided when they have certain duties associated to their role(s) [6]
      • licensing
        • included in D365 F&O license [4]
      • requirements
        • requires a tier two environment and Dynamics 365 finance version 1.0.38 or later [5]
      • {project} timeline
        • [2025 wave 1] backup and restore custom reports and analytics
          • {benefit} support better lifecycle management and empower customers to develop on sandbox instances before publishing to production [3]
        • 2025: available in all regions where F&O is available [3]
        • Oct-2024: GA

      References:
      [1] Microsoft Learn (2024) Dynamics 365 Finance: What is Business performance analytics? [link]
      [2] Microsoft Learn (2025) Business performance analytics (BPA) with Dynamics 365 Finance [link]
      [3] Dynamics 365 Finance - Business Performance Analytics 2025 Release Wave 1 Release Highlights [link]
      [4] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 1 [link]
      [5] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 2 [link]
      [6] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 3 [link]
      [7] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 4 [link]   
      [8] Dynamics 365 Community (2024) Dynamics 365 Bites: Business Performance Analytics Part 5 [link]
      [9] Microsoft Learn (2024) Dynamics 365: Business performance analytics introduction [link

      Acronyms:
      AI - Artificial Intelligence
      BPA - Business Performance Analytics
      D365 F&O - Dynamics 365 for Finance and Operations
      DaaS - Data-as-a-Service
      ETL - Extract, Transfer, Load
      GA - General Availability
      MF - Microsoft Fabric
      PP - Public Preview
      SaaS - Software-as-a-Service
      SKU - Stock Keeping Unit
      UTC - Coordinated Universal Time

      08 March 2025

      🏭🎗️🗒️Microsoft Fabric: Eventstreams [Notes]

      Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

      Last updated: 8-Mar-2025

      Real-Time Intelligence architecture
      Real-Time Intelligence architecture [4]

      [Microsoft Fabric] Eventstream(s)

      • {def} feature in Microsoft Fabric's Real-Time Intelligence experience, that allows to bring real-time events into Fabric
        • bring real-time events into Fabric, transform them, and then route them to various destinations without writing any code 
          • ⇐ aka no-code solution
          • {feature} drag and drop experience 
            • gives users an intuitive and easy way to create your event data processing, transforming, and routing logic without writing any code 
        • work by creating a pipeline of events from multiple internal and external sources to different destinations
          • a conveyor belt that moves data from one place to another [1]
          • transformations to the data can be added along the way [1]
            • filtering, aggregating, or enriching
      • {def} eventstream
        • an instance of the Eventstream item in Fabric [2]
        • {feature} end-to-end data flow diagram 
          • provide a comprehensive understanding of the data flow and organization [2].
      • {feature} eventstream visual editor
        • used to design pipelines by dragging and dropping different nodes [1]
        • sources
          • where event data comes from 
          • one can choose
            • the source type
            • the data format
            • the consumer group
          • Azure Event Hubs
            • allows to get event data from an Azure event hub [1]
            • allows to create a cloud connection  with the appropriate authentication and privacy level [1]
          • Azure IoT Hub
            • SaaS service used to connect, monitor, and manage IoT assets with a no-code experience [1]
          • CDC-enabled databases
            • software process that identifies and tracks changes to data in a database, enabling real-time or near-real-time data movement [1]
            • Azure SQL Database 
            • PostgreSQL Database
            • MySQL Database
            • Azure Cosmos DB
          • Google Cloud Pub/Sub
            • messaging service for exchanging event data among applications and services [1]
          • Amazon Kinesis Data Streams
            • collect, process, and analyze real-time, streaming data [1]
          • Confluent Cloud Kafka
            • fully managed service based on Apache Kafka for stream processing [1]
          • Fabric workspace events
            • events triggered by changes in Fabric Workspace
              • e.g. creating, updating, or deleting items. 
            • allows to capture, transform, and route events for in-depth analysis and monitoring within Fabric [1]
            • the integration offers enhanced flexibility in tracking and understanding workspace activities [1]
          • Azure blob storage events
            • system triggers for actions like creating, replacing, or deleting a blob [1]
              • these actions are linked to Fabric events
                • allowing to process Blob Storage events as continuous data streams for routing and analysis within Fabric  [1]
            • support streamed or unstreamed events [1]
          • custom endpoint
            • REST API or SDKs can be used to send event data from custom app to eventstream [1]
            • allows to specify the data format and the consumer group of the custom app [1]
          • sample data
            • out-of-box sample data
        • destinations
          • where transformed event data is stored. 
            • in a table in an eventhouse or a lakehouse [1]
            • redirect data to 
              • another eventstream for further processing [1]
              • an activator to trigger an action [1]
          • Eventhouse
            • offers the capability to funnel your real-time event data into a KQL database [1]
          • Lakehouse
            • allows to preprocess real-time events before their ingestion in the lakehouse
              • the events are transformed into Delta Lake format and later stored in specific lakehouse tables [1]
                • facilitating the data warehousing needs [1]
          • custom endpoint
            • directs real-time event traffic to a bespoke application [1]
            • enables the integration of proprietary applications with the event stream, allowing for the immediate consumption of event data [1]
            • {scenario} aim to transfer real-time data to an independent system not hosted on the Microsoft Fabric [1]
          • Derived Stream
            • specialized destination created post-application of stream operations like Filter or Manage Fields to an eventstream
            • represents the altered default stream after processing, which can be routed to various destinations within Fabric and monitored in the Real-Time hub [1]
          • Fabric Activator
            • enables to use Fabric Activator to trigger automated actions based on values in streaming data [1]
        • transformations
          • filter or aggregate the data as is processed from the stream [1]
          • include common data operations
            • filtering
              • filter events based on the value of a field in the input
              • depending on the data type (number or text), the transformation keeps the values that match the selected condition, such as is null or is not null [1]
            • joining
              • transformation that combines data from two streams based on a matching condition between them [1]
            • aggregating
              • calculates an aggregation every time a new event occurs over a period of time [1]
                • Sum, Minimum, Maximum, or Average
              • allows renaming calculated columns, and filtering or slicing the aggregation based on other dimensions in your data [1]
              • one can have one or more aggregations in the same transformation [1]
            • grouping
              • allows to calculate aggregations across all events within a certain time window [1]
                • one can group by the values in one or more fields [1]
              • allows for the renaming of columns
                • similar to the Aggregate transformation 
                • ⇐ provides more options for aggregation and includes more complex options for time windows [1]
              • allows to add more than one aggregation per transformation [1]
                • allows to define the logic needed for processing, transforming, and routing event data [1]
            • union
              • allows to connect two or more nodes and add events with shared fields (with the same name and data type) into one table [1]
                • fields that don't match are dropped and not included in the output [1]
            • expand
              • array transformation that allows to create a new row for each value within an array [1]
            • manage fields
              • allows to add, remove, change data type, or rename fields coming in from an input or another transformation [1]
            • temporal windowing functions 
              • enable to analyze data events within discrete time periods [1]
              • way to perform operations on the data contained in temporal windows [1]
                • e.g. aggregating, filtering, or transforming streaming events that occur within a specified time period [1]
                • allow analyzing streaming data that changes over time [1]
                  • e.g. sensor readings, web-clicks, on-line transactions, etc.
                  • provide great flexibility to keep an accurate record of events as they occur [1]
              • {type} tumbling windows
                • divides incoming events into fixed and nonoverlapping intervals based on arrival time [1]
              • {type} sliding windows 
                • take the events into fixed and overlapping intervals based on time and divides them [1]
              • {type} session windows 
                • divides events into variable and nonoverlapping intervals that are based on a gap of lack of activity [1]
              • {type} hopping windows
                • are different from tumbling windows as they model scheduled overlapping window [1]
              • {type} snapshot windows 
                • group event stream events that have the same timestamp and are unlike the other windowing functions, which require the function to be named [1]
                • one can add the System.Timestamp() to the GROUP BY clause [1]
              • {type} window duration
                • the length of each window interval [1]
                • can be in seconds, minutes, hours, and even days [1]
              • {parameter} window offset
                • optional parameter that shifts the start and end of each window interval by a specified amount of time [1]
              • {concept} grouping key
                • one or more columns in your event data that you wish to group by [1]
              • aggregation function
                • one or more of the functions applied to each group of events in each window [1]
                  • where the counts, sums, averages, min/max, and even custom functions become useful [1]
        • see the event data flowing through the pipeline in real-time [1]
        • handles the scaling, reliability, and security of event stream automatically [1]
          • no need to write any code or manage any infrastructure [1]
        • {feature} eventstream editing canvas
          • used to 
            • add and manage sources and destinations [1]
            • see the event data [1]
            • check the data insights [1]
            • view logs for each source or destination [1]
      • {feature} Apache Kafka endpoint on the Eventstream item
        • {benefit} enables users to connect and consume streaming events through the Kafka protocol [2]
          • application using the protocol can send or receive streaming events with specific topics [2]
          • requires updating the connection settings to use the Kafka endpoint provided in the Eventstream [2]
      • {feature} support runtime logs and data insights for the connector sources in Live View mode [3]
        • allows to examine detailed logs generated by the connector engines for the specific connector [3]
          • help with identifying failure causes or warnings [3]
          • ⇐ accessible in the bottom pane of an eventstream by selecting the relevant connector source node on the canvas in Live View mode [3]
      • {feature} support data insights for the connector sources in Live View mode [3]
      • {feature} integrates eventstreams CI/CD tools
          • {benefit} developers can efficiently build and maintain eventstreams from end-to-end in a web-based environment, while ensuring source control and smooth versioning across projects [3]
      • {feature} REST APIs
        •  allow to automate and manage eventstreams programmatically
          • {benefit} simplify CI/CD workflows and making it easier to integrate eventstreams with external applications [3]
      • {recommendation} use event streams feature with at least SKU: F4 [2]
      • {limitation} maximum message size: 1 MB [2]
      • {limitation} maximum retention period of event data: 90 days [2]

      References:
      [1] Microsoft Learn (2024) Microsoft Fabric: Use real-time eventstreams in Microsoft Fabric [link]
      [2] Microsoft Learn (2025) Microsoft Fabric: Fabric Eventstream - overview [link]
      [3] Microsoft Learn (2024) Microsoft Fabric: What's new in Fabric event streams? [link]
      [4] Microsoft Learn (2025) Real Time Intelligence L200 Pitch Deck [link]
      [5] Microsoft Learn (2025) Use real-time eventstreams in Microsoft Fabric [link]

      Resources:
      [R1] Microsoft Learn (2024) Microsoft Fabric exercises [link]
      [R2] Microsoft Fabric Updates Blog (2024) CI/CD – Git Integration and Deployment Pipeline [link]
      [R3] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

      Acronyms:
      API - Application Programming Interface
      CDC - Change Data Capture
      CI/CD  - Continuous Integration/Continuous Delivery
      DB - database
      IoT - Internet of Things
      KQL - Kusto Query Language
      RTI - Real-Time Intelligence
      SaaS - Software-as-a-Service
      SDK - Software Development Kit
      SKU - Stock Keeping Unit
      Related Posts Plugin for WordPress, Blogger...

      About Me

      My photo
      Koeln, NRW, Germany
      IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.