SQL Troubles: monitoring

Showing posts with label monitoring. Show all posts

06 July 2025

🧭Business Intelligence: Perspectives (Part 32: Data Storytelling in Visualizations)

Business Intelligence Series

From data-related professionals to book authors on data visualization topics, there are many voices that require from any visualization to tell a story, respectively to conform to storytelling principles and best practices, and this independently of the environment or context in which the respective artifacts are considered. The need for data visualizations to tell a story may be entitled, though in business setups the data, its focus and context change continuously with the communication means, objectives, and, at least from this perspective, one can question storytelling’s hard requirement.

Data storytelling can be defined as "a structured approach for communicating data insights using narrative elements and explanatory visuals" [1]. Usually, this supposes the establishment of a context, respectively a fundament on which further facts, suppositions, findings, arguments, (conceptual) models, visualizations and other elements can be based upon. Stories help to focus the audience on the intended messages, they connect and eventually resonate with the audience, facilitate the retaining of information and understanding the chain of implications the decisions in scope have, respectively persuade and influence, when needed.

Conversely, besides the fact that it takes time and effort to prepare stories and the afferent content (presentations, manually created visualizations, documentation), expecting each meeting to be a storytelling session can rapidly become a nuisance for the auditorium as well for the presenters. Like in any value-generating process, one should ask where the value in storytelling is based on data visualizations and the effort involved, or whether the effort can be better invested in other areas.

In many scenarios, requesting from a dashboard to tell a story is an entitled requirement given that many dashboards look like a random combination of visuals and data whose relationship and meaning can be difficult to grasp and put into a plausible narrative, even if they are based on the same set of data. Data visualizations of any type should have an intentional well-structured design that facilitates visual elements’ navigation, understanding facts’ retention, respectively resonate with the auditorium.

It’s questionable whether such practices can be implemented in a consistent and meaningful manner, especially when rich navigation features across multiple visuals are available for users to look at data from different perspectives. In such scenarios the identification of cases that require attention and the associations existing between well-established factors help in the discovery process.

Often, it feels like visuals were arranged aleatorily in the page or that there’s no apparent connection between them, which makes the navigation and understanding more challenging. For depicting a story, there must be a logical sequencing of the various visualizations displayed in the dashboards or reports, especially when visuals’ arrangement doesn’t reflect the typical navigation of the visuals or when the facts need a certain sequencing that facilitates understanding. Moreover, the sequencing doesn’t need to be linear but have a clear start and end that encompasses everything in between.

Storytelling works well in setups in which something is presented as the basis for one-time or limited in scope sessions like decision-making, fact-checking, awareness raising and other types of similar communication. However, when building solutions for business monitoring and data exploration, there can be multiple stories or no story worth telling, at least not for the predefined scope. Even if one can zoom in or out, respectively rearrange the visuals and add others to highlight the stories encompassed, the value added by taking the information out of the dashboards and performing such actions can be often neglected to the degree that it doesn’t pay off. A certain consistency, discipline and acumen is needed then for focusing on the important aspects and ignoring thus the nonessential.

Previous Post <<||>> Next Post

References:
[1] Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019 [quotes]

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)!

Last updated: 11-Apr-2025

[Microsoft Fabric] Copy job in Data Factory

{def}

{benefit} simplifies data ingestion with built-in patterns for batch and incremental copy, eliminating the need for pipeline creation [1]

across cloud data stores [1]
from on-premises data stores behind a firewall [1]
within a virtual network via a gateway [1]

elevates the data ingestion experience to a more streamlined and user-friendly process from any source to any destination [1]
{benefit} provides seamless data integration

through over 100 built-in connectors [3]
provides essential tools for data operations [3]

{benefit} provides intuitive experience

easy configuration and monitoring [1]

{benefit} efficiency

enable incremental copying effortlessly, reducing manual intervention [1]

{benefit} less resource utilization and faster copy durations

flexibility to control data movement [1]

choose which tables and columns to copy
map the data
define read/write behavior
set schedules that fit requirements whether [1]

applies for a one-time or recurring jobs [1]

{benefit} robust performance

the serverless setup enables data transfer with large-scale parallelism
maximizes data movement throughput [1]

fully utilizes network bandwidth and data store IOPS for optimal performance [3]

{feature} monitoring

once a job executed, users can monitor its progress and metrics through either [1]

the Copy job panel

shows data from the most recent runs [1]

reports several metrics

status
row read
row written
throughput

the Monitoring hub

acts as a centralized portal for reviewing runs across various items [4]

{mode} full copy

copies all data from the source to the destination at once

{mode|GA} incremental copy

the initial job run copies all data, and subsequent job runs only copy changes since the last run [1]
an incremental column must be selected for each table to identify changes [1]

used as a watermark

allows comparing its value with the same from last run in order to copy the new or updated data only [1]
the incremental column can be a timestamp or an increasing INT [1]

{scenario} copying from a database

new or updated rows will be captured and moved to the destination [1]

{scenario} copying from a storage store

new or updated files identified by their LastModifiedTime are captured and moved to the destination [1]

{scenario} copy data to storage store

new rows from the tables or files are copied to new files in the destination [1]

files with the same name are overwritten [1]

{scenario} copy data to database

new rows from the tables or files are appended to destination tables [1]

the update method to merge or overwrite [1]

{default} appends data to the destination [1]

the update method can be adjusted to

{operation} merge

a key column must be provided

{default} the primary key is used, if available [1]

{operation} overwrite

availability

the same regional availability as the pipeline [1]

billing meter

Data Movement, with an identical consumption rate [1]

{feature} robust Public API

{benefit} allows to automate and manage Copy Job efficiently [2]

{feature} Git Integration

{benefit} allows to leverage Git repositories in Azure DevOps or GitHub [2]
{benefit} allows to seamlessly deploy Copy Job with Fabric’s built-in CI/CD workflows [2]

{feature|preview} VNET gateway support

enables secure connections to data sources within virtual network or behind firewalls

Copy Job can be executed directly on the VNet data gateway, ensuring seamless and secure data movement [2]

{feature} upsert to Azure SQL Database
{feature} overwrite to Fabric Lakehouse
{feature} [Jul-2025] native CDC

enables efficient and automated replication of changed data including inserted, updated and deleted records from a source to a destination [5]

ensures destination data stays up to date without manual effort

improves efficiency in data integration while reducing the load on source systems [5]

see Data Movement - Incremental Copy meter

consumption rate of 3 CU

{benefit} zero manual intervention

automatically captures incremental changes directly from the source [5]

{benefit} automatic replication

keeps destination data continuously synchronized with source changes [5]

{benefit} optimized performance

processes only changed data

reduces processing time and minimizing load on the source [5]

smarter incremental copy

automatically detects CDC-enabled source tables and allows to select either CDC-based or watermark-based incremental copy for each table [5]

applies to

CDC-enabled tables

CDC automatically captures and replicates actions on data

non-CDC-enabled tables

Copy Job detects changes by comparing an incremental column against the last run [5]

then merges or appends the changed data to the destination based on configuration [5]

supported connectors

⇐ applies to sources and destinations
Azure SQL DB [5]
On-premises SQL Server [5]
Azure SQL Managed Instance [5]

{enhancement} column mapping for simple data modification to storage as destination store [2]
{enhancement} data preview to help select the right incremental column [2]
{enhancement} search functionality to quickly find tables or columns [2]
{enhancement} real-time monitoring with an in-progress view of running Copy Jobs [2]
{enhancement} customizable update methods & schedules before job creation [2]

Previous Post <<||>> Next Post

References:

[1] Microsoft Learn (2025) Fabric: What is the Copy job in Data Factory for Microsoft Fabric? [link]

[2] Microsoft Fabric Updates Blog (2025) Recap of Data Factory Announcements at Fabric Conference US 2025 [link]

[3] Microsoft Fabric Updates Blog (2025) Fabric: Announcing Public Preview: Copy Job in Microsoft Fabric [link]

[4] Microsoft Learn (2025) Fabric: Learn how to monitor a Copy job in Data Factory for Microsoft Fabric [link]
[5] Microsoft Fabric Updates Blog (2025) Fabric: Simplifying Data Ingestion with Copy job – Introducing Change Data Capture (CDC) Support (Preview) [link]
[6] Microsoft Learn (2025) Fabric: Change data capture (CDC) in Copy Job (Preview) [link]

[7] Microsoft Fabric Updates Blog (2025) Simplifying Data Ingestion with Copy job – Incremental Copy GA, Lakehouse Upserts, and New Connectors [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: Learn how to create a Copy job in Data Factory for Microsoft Fabric [link]
[R2] Microsoft Learn (2025) Microsoft Fabric decision guide: copy activity, Copy job, dataflow, Eventstream, or Spark [link]

Acronyms:

API - Application Programming Interface
CDC - Change Data Capture
CI/CD - Continuous Integration and Continuous Deployment

CU - Capacity Unit

DevOps - Development & Operations

DF - Data Factory
IOPS - Input/Output Operations Per Second

VNet - Virtual Network

04 February 2025

🧭Business Intelligence: Perspectives (Part 26: Monitoring - A Cockpit View)

Business Intelligence Series

The monitoring of business imperatives is sometimes compared metaphorically with piloting an airplane, where pilots look at the cockpit instruments to verify whether everything is under control and the flight ensues according to the expectations. The use of a cockpit is supported by the fact that an airplane is an almost "closed" system in which the components were developed under strict requirements and tested thoroughly under specific technical conditions. Many instruments were engineered and evolved over decades to operate as such. The processes are standardized, inputs and outputs are under strict control, otherwise the whole edifice would crumble under its own complexity.

In organizational setups, a similar approach is attempted for monitoring the most important aspects of a business. A few dashboards and reports are thus built to monitor and control what’s happening in the areas which were identified as critical for the organization. The various gauges and other visuals were designed to provide similar perspectives as the ones provided by an airplane’s cockpit. At first sight the cockpit metaphor makes sense, though at careful analysis, there are major differences.

Probably, the main difference is that businesses don’t necessarily have standardized processes that were brought under control (and thus have variation). Secondly, the data used doesn’t necessarily have the needed quality and occasionally isn’t fit for use in the business processes, including supporting processes like reporting or decision making. Thirdly, are high the chances that the monitoring within the BI infrastructures doesn’t address the critical aspects of the business, at least not at the needed level of focus, detail or frequency. The interplay between these three main aspects can lead to complex issues and a muddy ground for a business to build a stable edifice upon.

The comparison with airplanes’ cockpit was chosen because the number of instruments available for monitoring is somewhat comparable with the number of visuals existing in an organization. In contrast, autos have a smaller number of controls simple enough to help the one(s) sitting in the cockpit. A car’s monitoring capabilities can probably reflect the needs of single departments or teams, though each unit needs its own gauges with specific business focus. The parallel is however limited because the areas of focus in organizations can change and shift in other directions, some topics may have a periodic character while others can regain momentum after a long time.

There are further important aspects. At high level, the expectation is for software products and processes, including the ones related to BI topics, to have the same stability and quality as the mass production of automobiles, airplanes or other artifacts that have similar complexity and manufacturing characteristics. Even if the design process of software and manufacturing may share many characteristics, the similar aspects diverge as soon as the production processes start, respectively progress, and these are the areas where the most differences lie. Starting from the requirements and ending with the overall goals, everything resembles the characteristics of quick shifting sands on which is challenging to build any stabile edifice.

At micro level in manufacturing each piece was carefully designed and produced according to a set of characteristics that were proved to work. Everything must fit perfectly in the grand design and there are many tests and steps to make sure that happens. To some degree the same is attempted when building software products, though the processes break along the way with the many changes attempted, with the many cost, time and quality constraints. At some point the overall complexity kicks back; it might be still manageable though the overall effort is higher than what organizations bargained for.

Previous Post <<||>> Next Post

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

Business Intelligence Series

Last week Microsoft had a great session [1] on the present and future of SQL databases, a “light” version of Azure SQL databases designed for Microsoft Fabric environments, respectively workloads. SQL databases are currently available for testing, and after the first tests the product looks promising. Even if there are several feature gaps, it’s expected that Microsoft will bridge the gaps over time. Conversely, there might be features that don’t make sense in Fabric, respectively new features that need to be considered for facilitating the work in OneLake and its ecosystem.

During the session several Microsoft professionals answered the audience’s questions, and they did a great job. Even if the answers and questions barely scratched the surface, they offered some insight into what Microsoft wants to do. Probably the expectation is that SQL databases won’t need any administration - indexes being maintained automatically, infrastructure scaling as needed, however everything sounds too nice to be true if one considers in general the experience with RDBMS – the devil hides usually in details.

Even if the solutions built follow the best practices in the field, which frankly seldom happens, transferring the existing knowledge to Fabric may encounter some important challenges revolving around performance, flexibility, accessibility and probably costs. Even if SQL databases are expected to fill some minor gaps, considering the lessons of the past, such solutions can easily grow. Even if a lot of processing power is thrown at the SQL queries and the various functionality, customers still need to write quality code and refactor otherwise the costs will explode sooner or later.

As the practice has proven so many times while troubleshooting performance issues, sometimes one needs to use all the arsenal available – DBCC, DMVs and sometimes even undocumented features - to get a better understanding of what’s happening. Even if there are some voices stating that developers don’t need to know how the SQL engine works, just applying solutions blindly after a recipe can accidentally increase the value of code, though most likely it doesn’t exploit the full potential available. Unfortunately, this is a subjective topic without hard numbers to support it, and the stories told by developers and third-parties usually don’t tell the whole story.

It’s also true that diving deep into a database’s internal working requires time, that’s quite often not available, and the value for such an effort doesn’t necessarily pay off. Above this, there’s a software engineer’s aim of understanding of how things work. Otherwise, one should drop the engineering word and just call it coding. Conversely, the data citizen just needs a high-level knowledge of how things work, though the past 20-30 years proved that that’s often not enough. The more people don’t have the required knowledge, the higher the chances that code needs refactoring. Just remember the past issues organizations had with MS Access and Excel when people started to create their own solutions, the whole infrastructure being invaded by poorly designed solutions that continue to haunt some organizations even today.

Even if lot of technical knowledge can be transported to Microsoft Fabric, the new environments may still require also adequate tools that can be used for monitoring and troubleshooting. Microsoft seems to work in this direction, though from the information available the tools don’t and can’t offer the whole perspective. It will be interesting to see how much the current, respectively the future dashboards and various reports can help; respectively what important gaps will surface. Until the gaps are addressed, probably the SQL professional must rely on SQL scripts and the DMVs available. All this can be summarized in a few words: it will not be boring!

Previous Post <<||>> Next Post

References:
[1] Microsoft Reactor (2025) Ask The Expert - Fabric Edition - Fabric Databases [link]

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them.

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific).

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc. The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage.

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble.

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc.

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

08 December 2023

💎🏭SQL Reloaded: Microsoft Fabric (Part I: Monitoring the Warehouse)

I was exploring this week the Microsoft Fabric Warehouse and I observed that there're three views available under the queryinsight schema: exec_requests_history, frequently_run_queries and long_running_queries, According to their definitions, they are based on two database objects fabric_query_starting and fabric_query_completed that cannot be called directly.

Announced in the Nov-2023 update, the Query Insights (QI) feature is a "scalable, sustainable, and extendable solution to enhance the SQL analytics experience" (see Microsoft's documentation).

Strangely, the three views appear in the Model view together with the objects defined in the dbo or other user-defined schemas. One can hide the queryinsight objects, however doing this operation in each model is impractical, especially when the number of the objects defined in the respective schema will increase in time. On the other side, the respective objects might be useful in building a report for visualizing query's performance. (Probably, a multi-model solution would and/or further settings will allow more flexibility.)

Secondly, the objects from the queryinsight schema are not available in the sys.objects DMV:

SELECT top 10 *
FROM sys.objects 
WHERE name LIKE 'fabric%'

The exec_requests_history DMV (similar to the dm_exec_requests_history DMV from the standard SQL Server) references several DMVs, and it would be useful to retrieve the corresponding information within the same query:

 -- fabric warehouse
 SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
--, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh
WHERE status = 'Succeeded'--'Failed'

However, attempting to retrieve session information via the sys.dm_exec_sessions DMV leads to the below error message:

SELECT *
FROM queryinsights.exec_requests_history erh
     LEFT JOIN sys.dm_exec_sessions ses
       ON erh.session_id = ses.session_id
WHERE erh.status = 'Succeeded'--'Failed'

"The query references an object that is not supported in distributed processing mode.
Msg 15816, Level 16, State 7, Code line 11"

Using the standard SQL Sever system functions seem to work, as long the view from the queryinsights schema is not considered:

According to the documentation, "Some objects, like system views, and functions can't be used while you query data stored in Azure Data Lake or Azure Cosmos DB analytical storage. Avoid using the queries that join external data with system views, load external data in a temp table, or use some security or metadata functions to filter external data."

One can presume thus that fabric_query_starting and fabric_query_completed are stored in the Data Lake and behave like standard user-defined tables. Unfortunately, no documentation seems to be available on this.

I tried using a temporary table, as advised above:

-- dropping the temp table
--DROP TABLE IF EXISTS dbo.#requests_history;

-- create the temp table
CREATE TABLE dbo.#requests_history (
  distributed_statement_id uniqueidentifier
, start_time datetime2(6)
, end_time datetime2(6)
, total_elapsed_time_ms bigint
, login_name varchar(255)
, row_count bigint
, status varchar(50)
, session_id bigint
, connection_id uniqueidentifier
, program_name varchar(255)
, batch_id uniqueidentifier
, root_batch_id uniqueidentifier
, query_hash uniqueidentifier
, command varchar(max)
)

-- inserting a few records
INSERT INTO dbo.#requests_history
SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh;

-- retrieve the inserted records
SELECT *
FROM dbo.#requests_history;

Unfortunately, the attempt led to the same error message. Further investigating the issue I arrived to a known issue: "Temp table usage in Data Warehouse and SQL analytics endpoint". Hopefully, the fix will address this scenario as well. Otherwise, it might be easier to import the data into a solution (e.g. Power BI) and do in there the analysis.

Using temporary tables with DMVs seems to work (see post).

Please note that the values are case sensitive and only a subset of the standard data types are supported (see documentation).

You might want to check also the queries from the SQL Server System Catalog.

Update 23-Jan-2025: The last query continues to throw errors. It works to save the output to a permanent table in a schema that can be used for similar work. Conversely, I was able to create temporary tables for another scenario (see post).

Happy coding!

20 January 2023

💎SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

Identifying the SQL Server DMVs which are accessible for the Serverless SQL pool (see previous post), allowed me to identify besides sys.dm_exec_requests_history three more DMVs with statistics on the statements executed on the server: sys.dm_request_phases, sys.dm_request_phases_task_group_stats and sys.dm_request_phases_exec_task_stats. Untofurtunately, there seems to be no documentation available on these DMVs, and, at the time the post was written, there were also no further hits on google.com or bing.com found on the same.

sys.dm_request_phases

sys.dm_request_phases provides insights in the phases an execution statement goes through, and seems to summarize the other two views:

-- Azure Serverless SQL pool: request phases
SELECT TOP (100) dist_statement_id
, RPH.dist_request_id
, TRY_CAST(RPH.id as bigint) id
, TRY_CAST(RPH.parent_ids as bigint) parent_ids
, RPH.start_time
, RPH.end_time
--, RPH.total_elapsed_time_ms
--, RPH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, RPH.min_time_ms
--, RPH.min_time_ms/1000.0 min_time_sec
--, RPH.max_time_ms
--, RPH.max_time_ms/1000.0 max_time_sec
--, RPH.avg_time_ms
, RPH.avg_time_ms/1000.0 avg_time_sec
--, RPH.stdev_time_ms
--, RPH.stdev_time_ms/1000.0 stdev_time_sec -- it has no values
--, RPH.min_rows
--, RPH.max_rows
--, RPH.avg_rows
--, RPH.stdev_rows -- it has no values
, RPH.total_rows
--, RPH.total_bytes_processed
, RPH.total_bytes_processed/1028.0 total_kb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
FROM sys.dm_request_phases RPH
ORDER BY Id

dist_statement_id	dist_request_id	id	parent_ids	start_time	end_time	avg_time_sec	total_rows	total_kb_processed	state_desc	operation_type	input_dop	output_dop	task_retries	error_id
8C4386DC...	820E9FC6...	1	2	...09:58:34.213	...09:58:36.337	1.031	2030	2343.310311	succeeded	Shuffle	1	1	0	0
8C4386DC...	820E9FC6...	2	0	...09:58:36.447	...09:58:39.713	1.891	9	7145.193579	succeeded	Return	1	1	0	0
C9524971...	680DCB55...	3	4	...10:05:46.747	...10:05:47.057	0.203	2030	2343.310311	succeeded	Shuffle	1	1	0	0
C9524971...	680DCB55...	4	0	...10:05:47.057	...10:05:48.480	1.406	0	6630.101167	succeeded	Return	1	1	0	0
FD2D17AD...	C9453EF2...	5	6	...11:58:54.060	...11:58:55.297	0.547	10	1534.098249	succeeded	ComputeToControlNode	1	1	0	0
FD2D17AD...	C9453EF2...	6	0	...11:58:55.297	...11:58:55.420	0.125	10	4.074902	succeeded	Return	1	1	0	0
9FB0A268...	CAA533DE...	7	8	...11:59:16.483	...11:59:16.700	0.203	2030	2343.310311	succeeded	Shuffle	1	1	0	0
9FB0A268...	CAA533DE...	8	0	...11:59:16.700	...11:59:18.640	1.922	6	7143.673151	succeeded	Return	1	1	0	0
1732AB0D...	AC1A4F10...	9	10	...11:59:25.950	...11:59:26.140	0.172	2030	2343.310311	succeeded	Shuffle	1	1	0	0
1732AB0D...	AC1A4F10...	10	0	...11:59:26.140	...11:59:27.450	1.297	9	6635.185797	succeeded	Return	1	1	0	0

Notes:
1) The foreign keys and dates (in the above and below queries) were truncated to accomodate all the important attributes in the snapshot of the values returned.
2) Based on the exisitng queries, there are two records for each executed statement, a Shuffle or ComputeToControlNode followed by a Return (see operation_type). In more complex scenario there are several Shuffles and Broadcasts and a Return. According to the Microsoft team, even if for serverless SQL pools there's no Data Movement Service (DMS), there's a similar algorithm responsible for moving the data between the nodes.
3) Because in serverless SQL pool each query has its own distribution statement id, the min, max, avg and total values will have the sames values across the columns. Therefore, the columns with redundant values were commented.
4) The Id of the request phase seems to have numeric values despite being defined as alphanumeric. I tried to cast the values to bigint for sorting purposes.

sys.dm_request_phases_task_group_stats

sys.dm_request_phases_task_group_stats stores metadata about the requests breakdown at task group:

-- Azure Serverless SQL pool: request phases breakdown at task group
SELECT TOP (100) RPT.dist_request_id
, TRY_CAST(RPT.id as bigint) id
, TRY_CAST(RPT.parent_ids as bigint) parent_ids
, RPT.dist_statement_id
, RPT.state_desc
, RPT.start_time
, RPT.end_time
, RPT.input_dop
, RPT.output_dop
, RPT.operation_type
, RPT.task_retries
FROM sys.dm_request_phases_task_group_stats RPT
ORDER BY id

dist_request_id	id	parent_ids	dist_statement_id	state_desc	start_time	end_time	input_dop	output_dop	operation_type	task_retries
820E9FC6...	1	2	8C4386DC...	succeeded	638098055142132551	638098055163382693	1	1	Shuffle	0
820E9FC6...	2	0	8C4386DC...	succeeded	638098055164476163	638098055197133001	1	1	Return	0
680DCB55...	3	4	C9524971...	succeeded	638098059467450021	638098059470574953	1	1	Shuffle	0
680DCB55...	4	0	C9524971...	succeeded	638098059470574953	638098059484793682	1	1	Return	0
C9453EF2...	5	6	FD2D17AD...	succeeded	638098127340607112	638098127352951067	1	1	ComputeToControlNode	0
C9453EF2...	6	0	FD2D17AD...	succeeded	638098127352951067	638098127354202970	1	1	Return	0
CAA533DE...	7	8	9FB0A268...	succeeded	638098127564826084	638098127567013504	1	1	Shuffle	0
CAA533DE...	8	0	9FB0A268...	succeeded	638098127567013504	638098127586388549	1	1	Return	0
AC1A4F10...	9	10	1732AB0D...	succeeded	638098127659513620	638098127661388514	1	1	Shuffle	0
AC1A4F10...	10	0	1732AB0D...	succeeded	638098127661388514	638098127674513601	1	1	Return	0

Notes:
1) The DVM seems to return the same number of records as sys.dm_request_phases.
2) Observe the format of the start_time and end_time, probably the timestamps come from the Spark cluster and were not translated into an SQL Server data type.

sys.dm_request_phases_exec_task_stats

sys.dm_request_phases_exec_task_stats stores metadata about the requests breakdown at task level:

-- Azure Serverless SQL pool: request phases breakdown at task
SELECT TOP (100) RPE.dist_request_id
, TRY_CAST(RPE.id as bigint) id
--, RPE.min_time_ms
--, RPE.max_time_ms
, RPE.avg_time_ms/1000.0 avg_time_sec
--, RPE.stdev_time_ms
, RPE.total_bytes_processed
--, RPE.min_rows
--, RPE.max_rows
--, RPE.avg_rows
--, RPE.stdev_rows
, RPE.total_rows
, RPE.error_id
FROM sys.dm_request_phases_exec_task_stats RPE
ORDER BY id

dist_request_id	id	avg_time_sec	total_kb_processed	total_rows	error_id
820E9FC6...	1	1.031	2343.310311	2030	0
820E9FC6...	2	1.891	7145.193579	9	0
680DCB55...	3	0.203	2343.310311	2030	0
680DCB55...	4	1.406	6630.101167	0	0
C9453EF2...	5	0.547	1534.098249	10	0
C9453EF2...	6	0.125	4.074902	10	0
CAA533DE...	7	0.203	2343.310311	2030	0
CAA533DE...	8	1.922	7143.673151	6	0
AC1A4F10...	9	0.172	2343.310311	2030	0
AC1A4F10...	10	1.297	6635.185797	9	0

What does all this mean?

The lack of documentation makes it challenging to interpret the values of the views besides the data and metadata they offer. In a paper on POLARIS, the code given to the serveless SQL pool engine, a taks is defined as "a careful packaging of data and query processing into units [...] that can be readily moved across compute nodes and re-started at the task level" [1]. Therefore, one can assume that this is the level targetted by the sys.dm_request_phases_exec_task_stats DMV. Further on, the tasks are grouped at phase level according to the sys.dm_request_phases_task_group_stats, the metadata from the two DMVs being further combined into sys.dm_request_phases DMV.

If the meaning is kept from dedicated SQL pools, a shuffle operation indicates that data is moved between the frontend and backend nodes to satisfy a request, while a Result represents the operation of returning the result selt to client. The "ComputeToControlNode" operation involves a simple select (e.g. SELECT top 10) from a CETA and therefore no "Shuffle" is needed.

Requests' history

Further on, one can use the "Distributed statement id" to join the execution request phases with the request history, however matches will be found only for a small subset of the records (probably the executions since the pool started):

-- Azure Serverless SQL pool: requests history with request phase info
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
--, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb
, RPH.avg_time_ms/1000.0 avg_time_sec
, RPH.total_rows
, RPH.total_bytes_processed/1028.0/1028.0 total_mb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
     JOIN sys.dm_request_phases RPH
	   ON ERH.distributed_statement_Id = RPH.dist_statement_id
	  --AND RPH.parent_ids = 0 -- only the parent
ORDER BY RPH.Id DESC

Here's a subset of the result set focusing only on the statistical values:

distr_statement_Id	start_time	end_time	total_elapsed_time_sec	data_processed_mb	avg_time_sec	total_rows	total_mb_processed	operation_type	id	parent_ids
{8C4386D...	...8:24.4300000	...8:39.8266666	15.396	10	1.031	2030	2.279484738326	Shuffle	1	2
{8C4386D...	...8:24.4300000	...8:39.8266666	15.396	10	1.891	9	6.950577411478	Return	2	0
{C952497...	...5:45.2100000	...5:48.4933333	3.283	10	0.203	2030	2.279484738326	Shuffle	3	4
{C952497...	...5:45.2100000	...5:48.4933333	3.283	10	1.406	0	6.449514753891	Return	4	0
{FD2D17A...	...8:52.1400000	...8:55.4166666	3.276	10	0.547	10	1.492313471789	ComputeToControlNode	5	6
{FD2D17A...	...8:52.1400000	...8:55.4166666	3.276	10	0.125	10	0.003963912451	Return	6	0
{9FB0A26...	...9:15.1300000	...9:18.6366666	3.506	10	0.203	2030	2.279484738326	Shuffle	7	8
{9FB0A26...	...9:15.1300000	...9:18.6366666	3.506	10	1.922	6	6.949098395914	Return	8	0
{1732AB0...	...9:24.6900000	...9:27.4500000	2.76	10	0.172	2030	2.279484738326	Shuffle	9	10
{1732AB0...	...9:24.6900000	...9:27.4500000	2.76	10	1.297	9	6.454460892023	Return	10	0

Notes:
As can be seen, the volume of data processed and the elapsed time values don't match between the two tables, though they are close. The differences probably result from further steps occuring in the process.

Happy coding!

Previous Post <<||>> Next Post

References:
[1] Josep Aguilar-Saborit, Raghu Ramakrishnan et al, "POLARIS: The Distributed SQL Engine in Azure Synapse", VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020, DOI: https://doi.org/10.14778/3415478.3415545

15 January 2023

💎🏭SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

I feel sometimes flying blind when I build or troubleshoot SQL queries and I don't have the query plan and/or further statistics to understand how the database engine works, why some queries take longer than expected, etc. Unfortunately, Synapse serverless SQL pool doesn't seem to support showing exection plans in SQL Server Management Studio as per now (SHOWPLAN_XML is not supported for SET). I looked at my old queries based on the the sys.dm_exec_requests and sys.dm_exec_query_stats DMVs, however the results didn't proved to be what I was searching for. (

This weekend, I found Sidney Cirqueira's post on monitoring Synapse serverless SQL pools where he describes how to do that via the Monitoring hub, DMVs, QPI library, respectively Log Analytics. (You should check regularly the Azure Synapse Analytics Blog as it's full of goodies!)

Thus, I found out that there's a new DMV called sys.dm_exec_requests_history which provides at least the duration and the volume of data processes by each statement run on the service:

-- Azure Serverless SQL pool: requests' history 
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec

--, ERH.data_processed_mb
, ERH.data_processed_mb/1028.0 data_processed_gb
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
ORDER BY ERH.data_processed_mb DESC

It isn't much information, compared with the columns returned by sys.dm_exec_requests, but it's something to start with. At least it allows focusing on the queries with the longest duration (use the above query sorting the records based on the total_elapsed_time_ms descending) or highest volume of data processed:

-- Azure Serverless SQL pool: queries with most data processed
SELECT TOP 50 ERH.query_text  
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY ERH.query_text
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

The same query can be slightly changed to retrieve the volume of data processed by month:

-- Azure Serverless SQL pool: data processed by month
SELECT Convert(nvarchar(7), ERH.start_time, 23) [period]
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY Convert(nvarchar(7), ERH.start_time, 23)
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

One can add in the grouping also the login name to break down the analysis by the login that issued the query. Organization's domain can be used to differentiate between system or organization-baed queries.

The volume of data processed is stored also in the sys.dm_external_data_processed DMV aggregated for the current day, week, respectively month as part of the cost control related feature:

-- Azure Serverless SQL pool: volume of data processed
SELECT type 
, data_processed_mb 
, data_processed_mb/1028.0 data_processed_gb
FROM sys.dm_external_data_processed

And here's how the output looks like:

type	data_processed_mb	data_processed_gb
daily	230	0.223735
weekly	377	0.366731
monthly	223522	217.433852

Notes:

1) I still need to play with the DMVs to understand their scope and limitations.

2) The view appears also in the list of DMVs I idenfitied to be supported the by Synapse serverless SQL pool. As I discoered later, 3 more DMVs are available with useful statistics.

3) ~~The queries based on sys.dm_exec_requests and sys.dm_exec_query_stats DMVs seem to return only the running query based on them.~~ (Actually, the DMVs seem to work.)

4) The view is available also in SQL Server 2022, though it doesn't seem to be used.

5) According to the above-mentioned source, the view is provided for ticket purposes to help customers better troubleshooting the SQL requests. Use the distributed_statement_id in the tickets raised with Microsoft to troubleshoot any issues with Synapse.

6) Unfortunately, also the useful Query Store feature is not yet supported, even if the DMVs related to it seem to be available. Attempting to enable it results in the error:

"Msg 15869, Level 16, State 9, Line 1
QUERY_STORE is not supported for ALTER DATABASE"

Previous Post <<||>> Next Post

Happy coding!

SQL Troubles

Pages

06 July 2025

🧭Business Intelligence: Perspectives (Part 32: Data Storytelling in Visualizations)

12 April 2025

🏭🗒️Microsoft Fabric: Copy job in Data Factory [Notes]

04 February 2025

🧭Business Intelligence: Perspectives (Part 26: Monitoring - A Cockpit View)

02 February 2025

🏭 💠Data Warehousing: Microsoft Fabric (Part VIII: More on SQL Databases)

17 March 2024

🧭Business Intelligence: Data Products (Part II: The Complexity Challenge)

08 December 2023

💎🏭SQL Reloaded: Microsoft Fabric (Part I: Monitoring the Warehouse)

20 January 2023

💎SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

15 January 2023

💎🏭SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

About Me