Showing posts with label queries. Show all posts
Showing posts with label queries. Show all posts

20 February 2025

💠🛠️🗒️SQL Server: Folding [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources. It considers only on-premise SQL Server, for other platforms please refer to the documentation.

Last updated: 20-Feb-2024

[SQL Server] Folding

  • {def} the process by which the optimizer is able to properly determine certain types of indexable expressions even when the column in the expression is involved in a subexpression or nestled in a function 
    • is an optimization over older versions of SQL Server in which the optimizer was unable to use an index to service a query clause when the table column was involved in an expression or buried in a function [1]
  • {type} constant folding 
    • some constant expression is evaluated early
    • foldable constant expressions [2]
      • arithmetical expressions 
      • logical expressions 
      • built-in functions whose input doesn’t depend of contextual information, 
        • e. g. SET options, language settings, database options, encryption keys
        • deterministic built-in functions are foldable, with some exceptions
      • certain forms of the LIKE predicate
      • [SQL Server 2012+] deterministic methods of CLR user-defined types [3]
      • [SQL Server 2012+] deterministic scalar-valued CLR user-defined functions [3]
    • nonfoldable expressions [2]
      • expressions whose results depend on a local variable or parameter
      • user-defined functions
        • both T-SQL and CLR
      • expressions whose results depend on language settings.
      • expressions whose results depend on SET options.
      • expressions whose results depend on server configuration options.
      • nonconstant expressions such as an expression whose result depends on the value of a column.
      • nondeterministic functions
      • if the output is a large object type, then the expressions are not folded 
        • e.g. text, image, nvarchar(max), varchar(max), varbinary(max), XML
    • {benefit} the expression does not have to be evaluated repeatedly at run time [2]
    • {benefit} the value of the expression after it is evaluated is used by the query optimizer to estimate the size of the result set of the portion of the query [2]
      • e.g. TotalDue > 117.00 + 1000.00
  • {type} nonconstant folding
    • some expressions that are not constant folded but whose arguments are known at compile time, whether the arguments are parameters or constants, are evaluated by the result-set size (cardinality) estimator that is part of the optimizer during optimization [2]
    • deterministic functions: 
      • e.g. UPPER, LOWER, RTRIM
      • e.g. DATEPART( YY only ), GetDate, CAST, CONVERT
    • operators 
      • arithmetic operators: +, -, *, /, unary -, 
      • logical Operators: AND, OR, NOT
      • comparison operators: <, >, <=, >=, <>, LIKE, IS NULL, IS NOT NULL

Previous Post <<||>> Next Post

References:
[1] Ken Henderson (2003) Guru's Guide to SQL Server Architecture and Internals
[2] Microsoft Learn (2012) SQL Server: Troubleshooting Poor Query Performance: Constant Folding and Expression Evaluation During Cardinality Estimation [link]
[3] Microsoft Learn (2025) SQL Server: Query processing architecture guide [link]
[4] SQLShack (2021) Query Optimization in SQL Server for beginners, by Esat Erkec [link]

08 February 2025

🌌🏭KQL Reloaded: First Steps (Part X: Translating SQL to KQL - Correlated Subqueries)

In SQL Server and other RDBMS databases there are many  scenarios in which one needs information from a fact table based on a dimension table without requiring information from the dimension table. 

Correlated Subquery via EXISTS

Before considering the main example, let's start with a simple subquery:

// subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE ProductKey IN (
    SELECT DISTINCT ProductKey
    FROM Products 
    WHERE ProductSubcategoryName = 'MP4&MP3'
    ) 

// subquery in KQL
NewSales
| where ProductKey in (
    (Products
    | where (ProductSubcategoryName == "MP4&MP3")
    | project ProductKey
    | distinct *))
| project CustomerKey, ProductKey, ProductName

Of course, the ProductKey is unique by design, though there can be dimension, fact tables or subqueries in which the value is not unique.

Now let's consider the correlated subquery pattern, which should provide the same outcome as above, though in RDBMS there are scenarios in which it provides better performance, especially when the number of values from subquery is high.

// correlated subquery in SQL
--
explain
SELECT CustomerKey
, ProductKey
, ProductName
FROM NewSales 
WHERE EXISTS (
    SELECT Products.ProductKey
    FROM Products 
    WHERE NewSales.ProductKey = Products.ProductKey)

Unfortunately, trying to translate the code via explain leads to the following error, which confirms that the syntax is not supported in KQL (see [1]):

"Error: Reference to missing column 'NewSales.ProductKey'"

 Fortunately, in this case one can use the first version of the query. 

Correlated Subquery via CROSS APPLY

Before creating the main query, let's look at the inner query and check whether it gets correctly translate to KQL:

// subquery logic
--
explain
SELECT sum(TotalCost) TotalCost 
FROM NewSales 
WHERE DateKey > '20240101' and DateKey <'20240201'

Now, let's bring the logic within the CROSS APPLY:

// correlated subquery in SQL
--
explain
SELECT ProductKey
, ProductName
, TotalCost
FROM Products
    CROSS APPLY (
        SELECT sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
          AND Products.ProductKey = NewSales.ProductKey
    ) DAT

Running the above code leads to the following error:

"Sql node of type 'Microsoft.SqlServer.TransactSql.ScriptDom.UnqualifiedJoin' is not implemented"

Unfortunately, many SQL queries are written following this pattern, especially when an OUTER CROSS APPLY is used, retrieving thus all the records from the dimension table. 

In this case one can rewrite the query via a RIGHT JOIN:

// correlated subquery in SQL
--
explain
SELECT PRD.ProductKey
, PRD.ProductName
, SAL.TotalCost
FROM Products PRD
    LEFT JOIN (
        SELECT ProductKey
        , sum(TotalCost) TotalCost 
        FROM NewSales 
        WHERE DateKey > '20240101' and DateKey <'20240201'
        GROUP BY ProductKey
    ) SAL
      ON PRD.ProductKey = SAL.ProductKey

// direct translation of the query
Products
| join kind=leftouter 
    (NewSales
        | where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
        | summarize TotalCost=sum(TotalCost) by ProductKey
        | project ProductKey, TotalCost
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()


// query after restructuring
NewSales
| where ((DateKey > todatetime("20240101")) and (DateKey < todatetime("20240201")))
| summarize TotalCost=sum(TotalCost) by ProductKey
| join kind=rightouter
    (
        Products
        | project ProductKey, ProductName
    ) on ($left.ProductKey == $right.ProductKey)
| project ProductKey, ProductName, TotalCost
//| where isnull(TotalCost)
//| summarize record_number = count()

During transformations it's important to check whether the number of records changes between the various versions of the query (including the most general version in which filtering constraints were applied).

Especially when SQL solutions are planned to be migrated to KQL, it's important to know which query patterns can be used in KQL. 

Happy coding!

Previous Post <<||>>  Next Post

References:
[1] GitHib (2024) Module Ex-01 - Advanced KQL [link]

14 January 2025

🧭Business Intelligence: Perspectives (Part XXII: Queries' Complexity)

Business Intelligence Series
Business Intelligence Series

Independently whether standalone or encapsulated in database objects, the queries written can become complex in time, respectively difficult to comprehend and maintain. One can reduce the cognitive load by identifying the aspects that enable one’s intuition - order, similarity and origin being probably the most important aspects that help coping with the inherent complexity. 

One should start with the table having the lowest level of detail, usually a transaction table that forms the backbone of a certain feature. For example, for Purchase Orders this could be upon case the distribution or line level. If Invoices are added to the logic, and there could be multiple invoice line for a record from the former logic, then this becomes the new level of detail. Further on, if General Ledger lines are added, more likely this becomes the lowest level of detail, and so on.

This approach allows to keep a consistent way of building the queries while enabling to validate the record count, making sure that no duplicates are added to the logic. Conversely, one can start also from the table with the lowest level of details, and add tables successively based on their level of detail, though the database engine may generate upon case a suboptimal plan. In addition, checking the cardinality may involve writing a second query. 

One should try to keep the tables belonging to the same entity together, when possible (e.g. Purchase Order vs. Vendor information). This approach allows to reduce the volume of work required to manage, review, test and understand the query later. If the blocks are too big, then occasionally it makes sense to bring pieces of logic into CTEs (Common Table Expressions), or much better into views that allow to better encapsulate and partition the logic.

CTEs allow to split the logic into logical steps, allowing occasionally to troubleshoot the logic on pieces though one should keep a balance between maintainability and detail. In extremis, people may create unnecessarily an additional CTE for each join. The longer and the more fragmented a query, the more difficult it becomes to troubleshoot and even understand. Readability can be better achieved though indentation, by keeping things together that belong together, respectively partitioning the logic in logical blocks that derive from the model. 

Keeping things together should be applied also to the other elements. For example, the join constraints should be limited only to the fields participating in the join (and, if possible, all the other constraints should be brought in the WHERE clause). Bringing the join constraints in the WHERE clause, an approach used in the past, decreases query readability no matter how well the WHERE clause is structured, and occasionally can lead to constraints’ duplication or worse, to missing pieces of logic. 

The order of the attributes should follow a logic that makes sense. One approach is to start from the tables with lowest cardinality that identify entities uniquely and move to the attributes that have a higher level of detail. However, not all attributes are equally important, and thus one might have to compromise and create multiple groups of data coming from different levels. 

One should keep in mind that the more random the order of the attributes is, the more difficult it becomes to validate the data as one needs to jump multiple times either in the query or in the mask. Ideally one should find a balance between the two perspectives. Having an intuitive logic of how the attributes are listed increases queries’ readability, maintainability and troubleshooting. The more random attributes’ listing, the higher the effort for the same. 

Previous Post  <<||>>   Next Post

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

09 February 2010

🕋Data Warehousing: Dimensional Model (Definitions)

"A dimensional model is a form of data modeling that packages data according to specific business queries and processes. The goals are business user understandability and multidimensional query performance." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A data model influenced by performance considerations and optimized for data access and retrieval. This is a form of entity-relationship (ER) modeling that’s limited to working with “measurement” types of subjects and has a prescribed set of allowable structures." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A dimensional model contains a central fact table and a set of surrounding dimension tables, each corresponding to one of the components or dimensions of the fact table." (James Yao et al, "Development and Design Methodologies in DWM", 2008)

"A data model organized for the purpose of user understandability and high performance. In a relational database, a dimensional model is a star join schema characterized by a central fact table with a multi-part key. The components of this key are joined to a set of dimension tables, each defined by its own primary keys. In a multi-dimensional database, a dimensional model is the database cube." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A data model that represents data in a star-like structure of only one-to-many relationships, where each entity has either all relationships having the ‘one’ side or the ‘many’ side." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A specialized type of physical data model particular to a retrieval-only database design, commonly used in Data Warehouses and data marts, where de-normalized fact tables are linked to dimension tables. Star schemas and snowflake schemas are examples of dimensional models." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

28 January 2010

💎SQL Reloaded: Query Writing I (Best Practices)

In general, when creating queries, the following best practices should be considered:
- Use ANSI-compliant syntax as much as possible.
- Use aliases for all the tables.
- Design for performance reusability and security.
- Reduce unnecessary network traffic.
- Group together the attributes coming from the same table.
- Rename columns in order to avoid confusion, though don’t overreact with this practice.
- Use uniform coding style, formatting and naming conventions
- Use indexed join predicates.
- Learn about the strengths/weaknesses of each feature before using it.
- Use versioning & keep older versions. - Test the queries.
- Write unit tests first [2].
- Handle missing values (NULLs).
- Document database objects inline and specific documents (e.g. Data Dictionaries, Functional Specifications).
- Refactor code
- Use SQL tunings tools.
- Write tiny chunks of code: encapsulate formulas and business logic in functions [2], avoid inline scalar functions.
- Defensive coding: use exception handling, consider all scenarios.


Things to avoid:
- Complex expressions in search conditions [1].
- Join predicates on expressions [1].
- Expressions over columns in local predicated [1].
- Data types mismatches on join columns [1].
- Non-equality join predicates [1].
- Unnecessary outer joins [1].
- Redundant predicates [1].
- Multiple aggregations with DISTINCT.
- Build queries dynamically unless necessary.
- Techniques that use full-table scan: functions that don’t use/perform poor on indexes, wildcards at the beginning of a word.
- Use more attributes/records than needed (particular case: Use * instead of specifying the attributes).
- Using nested views.
- Rely entirely on the code created by wizards and other automation tools.
- UNION unless really needed: use UNION ALL.
- UNIONS instead of conditional-base code (e.g. CASE, DECODE) or self-joins.
- Using temporary tables.
- Server side cursors.
- Procedural queries (e.g. loops, cursors) rather than using set-based queries.
- Redundant logic/code.
- Negations on constraints.
- Code facilitating SQL injection: use parameterized objects.
- Hard-coding values.
- Undocumented functionality.
- Use GROUP BY on final sub-query when it could be used in a sub-query.
- Multiple self-joins/joins to same table instead of GROUP BY.
- Repetitive calls to the same function and same parameters.
- Use constants in ORDER BY clause.
- Create too many versions of the same query.


Note:
Please note that there are situations and situations, a technique not recommended in general could prove to offer better performance than alternatives (e.g. : recursive simulation + temporary table vs. hierarchical self-joins on SQL Server 2000 ), while for others there are several aspects that need to be considered, for example the trade in performance vs. reusability. Even if most of the database vendors adhere to SQL ANSI standard, in the end each functionality could be implemented differently and vendors could provide additional functionality, therefore could be considered specific best practices for each functionality/vendor.

References:
[1] IBM (2010) Best Practice - Writing and Tuning Queries for Optimal Performance [link]
[2] Oracle (2009) Cleaning Up PL/SQL Practices, by S. Feuerstein [link]

Resources:
Microsoft TechNet. (2010). SQL Server - Best Practices [link]
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.