Showing posts with label query hints. Show all posts
Showing posts with label query hints. Show all posts

06 April 2025

🏭🗒️Microsoft Fabric: Query Optimizer in Warehouse [Notes] 🆕

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 6-Apr-2025

[Microsoft Fabric] Hints in Warehouse
  • {def} keywords that users can add to SQL statements to provide additional information or instructions to the query optimizer [2]
    • options or strategies specified for enforcement by the SQL Server query processor [1]
      • applicable to SELECT, INSERT, UPDATE, or DELETE statements [1]
  • {benefit} help improve the performance, scalability, or consistency of queries by overriding the default behavior of the query optimizer [2]
  • {type} join hints
    • specify what join strategy/algorithm used between two tables [3]
      • improves the performance of queries that involve large or complex joins [2]
      • specified in the FROM clause of a query [1]
      • if a join hint is specified for any two tables, the query optimizer automatically enforces the join order for all joined tables in the query [3]
        • based on the position of the ON keywords [3]
          • when a CROSS JOIN is used without the ON clause, parentheses can be used to indicate the join order [3]
    • looping
      • via LOOP JOIN
      • {restriction} can't be specified together with RIGHT or FULL as a join type [3]
    • hashing
      • via HASH JOIN
    • merging
      • via MERGE JOIN
    • REPLICATE
      • causes a broadcast move operation
        • a specific table to be replicated across all distribution nodes [2]
      • with INNER or LEFT` join
        • the broadcast move operation will replicate the right side of the join to all nodes [2]
      • with RIGHT join
        • the broadcast move operation will replicate the left side of the join to all nodes [2]
      • with FULL` join
        • an estimated plan cannot be created [2]
    • REDISTRIBUTE [(colsCount)]
      • ensures two data sources are distributed based on JOIN clause columns [2]
      • handles multiple join conditions, specified by the first n columns in both tables, where n is the column_count argument [2]
      • redistributing data optimizes query performance by evenly spreading data across nodes during intermediate steps of execution [2]
      • the (columns_count) argument is only supported in MF warehouse [2]
  • {type} query hint
    • specify that the indicated hints are used in the scope of a query [3]
      • affect all operators in the statement
        • [UNION only the last query involving a UNION operation can have the OPTION clause [3]
      • specified as part of the OPTION clause [3]
      • Error 8622 occurs if one or more query hints cause the Query Optimizer not to generate a valid plan [3]
    • used via the OPTION clause at the end of a query [2]
      • followed by the name of the query hint and its optional parameters in parentheses [2]
      • multiple hints can be used in the same query, separated by commas
        • e.g. FORCE ORDER and MAX_GRANT_PERCENT
          • instruct the QO to preserve the join order specified in the query and to limit the memory grant to 20 percent of the available memory
    • {hint} HASH GROUP
      • specifies that the QO should use a hash-based algorithm for the GROUP BY operation [2]
      • {benefit} can improve the performance of queries that involve large or complex grouping sets [2]
    • {hint} ORDER GROUP
      • specifies that the QO should use a sort-based algorithm for the GROUP BY operation [2]
      • {benefit} can improve the performance of queries that involve small or simple grouping sets [2]
    • {hint} MERGE UNION
      • specifies that the QO should use a merge-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improve the performance of queries that involve sorted inputs [2]
    • {hint} HASH UNION
      • specifies that the query optimizer should use a hash-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improve the performance of queries that involve unsorted or large inputs [2]
    • {hint} CONCAT UNION
      • specifies that the QO should use a concatenation-based algorithm for the UNION or UNION ALL operation [2]
      • {benefit} can improves the performance of queries that involve distinct or small inputs[2]
    • {hint} FORCE ORDER
      • specifies that the QO should preserve the join order specified in the query [2]
      • {benefit} can improves the performance or consistency of queries that involve complex join conditions or hints [2]
    • {hint} FORCE SINGLE NODE PLAN/FORCE DISTRIBUTED PLAN
      • allows to choose whether to force a single node plan or a distributed plan for query’s execution [2]
    • {hint} USE HINT
      • adds one or more extra hints to the query processor, where the hints are specified with a hint name inside single quotation marks inside OPTION clause [2] OPTION(USE HINT(‘ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES’)) 
      • used with several hint names, changing the behavior of CE derivation
        • {hint name} ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters [2]
          • MF assumes full correlation among filters when a high level of underestimation on AND predicates for filters is observed [2]
          • [SQL Server] equivalent TF 4137 [4]
        • {hint name} ASSUME_FULL_INDEPENDENCE_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters  [2]
          • MF assumes full independence among filters [2]
            • if a high level of overestimation on AND predicates for filters is observed, this hint can help produce a better estimate [2]
        • {hint name} ASSUME_PARTIAL_CORRELATION_FOR_FILTER_ESTIMATES
          • applies when calculating cardinality estimates for AND predicates for filters [2]
          • {default} MF assumes partial correlation among filters [2]
            • ⇐ it is unlikely that this hint will help improve the estimates [2]
        • {hint name} ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS
          • applies when calculating cardinality estimate for joins [2]
          • uses Simple Containment assumption instead of the default Base Containment assumption [2]
          • [SQL Server] equivalent TF 9476 [4]
  • {type} table hints
    • none enforced currently

References:
[1] Microsoft Learn (2025) SQL: Hints (T-SQL) [link]
[2] Microsoft Fabric Updates Blog (2025) Hints in Fabric Data Warehouse [link]
[3] Microsoft Learn (2025) SQL: OPTION clause (T-SQL) [link]
[4] Microsoft Support (2016) KB3189813 - Update introduces USE HINT query hint argument in SQL Server 2016 [link]

Resources:
[R1] Microsoft Learn (2025) SQL Server: Query hints (T-SQL) [link]
[R2] Most Useful Query Hints [link]

Acronyms:
MF - Microsoft Fabric
TF - Trace Flag
QO - Query Optimizer

28 February 2017

🧊Data Warehousing: Data Load Optimization (Part I: A Success Story)

Data Warehousing
Data Warehousing Series

Introduction

This topic has been waiting in the queue for almost two years already - since I finished optimizing an already existing relational data warehouse within a SQL Server 2012 Enterprise Edition environment. Through various simple techniques I managed then to reduce the running time for the load process by more than 65%, from 9 to 3 hours. It’s a considerable performance gain, considering that I didn’t have to refactor any business logic implemented in queries.

The ETL (Extract, Transform, Load) solution was making use of SSIS (SQL Server Integration Services) packages to load data sequentially from several sources into staging tables, and from stating further into base tables. Each package was responsible for deleting the data from the staging tables via TRUNCATE, extracting the data 1:1 from the source into the staging tables, then loading the data 1:1 from the staging table to base tables. It’s the simplest and a relatively effective ETL design I also used with small alterations for data warehouse solutions. For months the data load worked smoothly, until data growth and eventually other problems increased the loading time from 5 to 9 hours.

Using TABLOCK Hint

Using SSIS to bulk load data into SQL Server provides an optimum of performance and flexibility. Within a Data Flow, when “Table Lock” property on the destination is checked, it implies that the insert records are minimally logged, speeding up the load by a factor of two. The TABLOCK hint can be used also for other insert operations performed outside of SSIS packages. At least in this case the movement of data from staging into base tables was performed in plain T-SQL, outside of SSIS packages. Also further data processing had benefitted from this change. Only this optimization step alone provided 30-40% performance gain.

Drop/Recreating the Indexes on Big Tables

As the base tables were having several indexes each, it proved beneficial to drop the indexes for the big tables (e.g. with more than 1000000 records) before loading the data into the base tables, and recreate the indexes afterwards. This was done within SSIS, and provided an additional 20-30% performance gain from the previous step.

Consolidating the Indexes

Adding missing indexes, removing or consolidating (overlapping) indexes are typical index maintenance tasks, apparently occasionally ignored. It doesn’t always bring much performance as compared with the previous methods, though dropping and consolidating some indexes proved to be beneficial as fewer data were maintained. Data processing logic benefited from the creation of new indexes as well.

Running Packages in Parallel

As the packages were run sequentially (one package at a time), the data load was hardly taking advantage of the processing power available on the server. Even if queries could use parallelism, the benefit was minimal. Enabling packages run in parallel added additional performance gain, however this minimized the availability of processing resources for other tasks. When the data load is performed overnight, this causes minimal overhead, however it should be avoided when the data are loading to business hours.

Using Nonclustered Indexes

In my analysis I found out that many tables, especially the ones storing prepared data, were lacking a clustered index, even if further indexes were built on them. I remember that years back there was a (false) myth that fact and/or dimension tables don’t need clustered indexes in SQL Server. Of course clustered indexes have downsides (e.g. fragmentation, excessive key-lookups) though their benefits exceed by far the downsides. Besides missing clustered index, there were cases in which the tables would have benefited from having a narrow clustered index, instead of a multicolumn wide clustered index. Upon case also such cases were addressed.

Removing the Staging Tables

Given the fact that the source and target systems are in the same virtual environment, and the data are loaded 1:1 between the various layers, without further transformations and conversions, one could load the data directly into the base tables. After some tests I came to the conclusion that the load from source tables into the staging table, and the load from staging table into base table (with TABLOCK hint) were taking almost the same amount of time. This means that the base tables will be for the same amount of the time unavailable, if the data were loaded from the sources directly into the base tables. Therefore one could in theory remove the staging tables from the architecture. Frankly, one should think twice when doing such a change, as there can be further implications in time. Even if today the data are imported 1:1, in the future this could change.

Reducing the Data Volume

Reducing the data volume was identified as a possible further technique to reduce the amount of time needed for data loading. A data warehouse is built based on a set of requirements and presumptions that change over time. It can happen for example that even if the reports need only 1-2 years’ worth of data, the data load considers a much bigger timeframe. Some systems can have up to 5-10 years’ worth of data. Loading all data without a specific requirement leads to waste of resources and bigger load times. Limiting the transactional data to a given timeframe can make a considerable difference. Additionally, there are historical data that have the potential to be archived.

There are also tables for which a weekly or monthly refresh would suffice. Some tables or even data sources can become obsolete, however they continue to be loaded in the data warehouse. Such cases occur seldom, though they occur. Also some unused or redundant column could have been removed from the packages.

Further Thoughts

There are further techniques to optimize the data load within a data warehouse like partitioning large tables, using columnstore indexes or optimizing the storage, however my target was to provide maximum sufficient performance gain with minimum of effort and design changes. Therefore I stopped when I considered that the amount of effort is considerable higher than the performance gain.

Further Reading:
[1] TechNet (2009) The Data Loading Performance Guide, by Thomas Kejser, Peter Carlin & Stuart Ozer (link)
[2] MSDN (2010) Best Practices for Data Warehousing with SQL Server 2008 R2, by Mark Whitehorn, Keith Burns & Eric N Hanson (link)
[3] MSDN (2012) Whitepaper: Fast Track Data Warehouse Reference Guide for SQL Server 2012, by Eric Kraemer, Mike Bassett, Eric Lemoine & Dave Withers (link)
[4] MSDN (2008) Best Practices for Data Warehousing with SQL Server 2008, by Mark Whitehorn & Keith Burns (link)
[5] TechNet (2005) Strategies for Partitioning Relational Data Warehouses in Microsoft SQL Server, by Gandhi Swaminathan (link)
[6] SQL Server Customer Advisory Team (2013) Top 10 Best Practices for Building a Large Scale Relational Data Warehouse (link)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.