Showing posts with label monitoring. Show all posts
Showing posts with label monitoring. Show all posts

17 March 2024

Business Intelligence: Data Products (Part II: The Complexity Challenge)

Business Intelligence
Business Intelligence Series

Creating data products within a data mesh resumes in "partitioning" a given set of inputs, outputs and transformations to create something that looks like a Lego structure, in which each Lego piece represents a data product. The word partition is improperly used as there can be overlapping in terms of inputs, outputs and transformations, though in an ideal solution the outcome should be close to a partition.

If the complexity of inputs and outputs can be neglected, even if their number could amount to a big number, not the same can be said about the transformations that must be performed in the process. Moreover, the transformations involve reengineering the logic built in the source systems, which is not a trivial task and must involve adequate testing. The transformations are a must and there's no way to avoid them. 

When designing a data warehouse or data mart one of the goals is to keep the redundancy of the transformations and of the intermediary results to a minimum to minimize the unnecessary duplication of code and data. Code duplication becomes usually an issue when the logic needs to be changed, and in business contexts that can happen often enough to create other challenges. Data duplication becomes an issue when they are not in synch, fact derived from code not synchronized or with different refresh rates.

Building the transformations as SQL-based database objects has its advantages. There were many attempts for providing non-SQL operators for the same (in SSIS, Power Query) though the solutions built based on them are difficult to troubleshoot and maintain, the overall complexity increasing with the volume of transformations that must be performed. In data mashes, the complexity increases also with the number of data products involved, especially when there are multiple stakeholders and different goals involved (see the challenges for developing data marts supposed to be domain-specific). 

To growing complexity organizations answer with complexity. On one side the teams of developers, business users and other members of the governance teams who together with the solution create an ecosystem. On the other side, the inherent coordination and organization meetings, managing proposals, the negotiation of scope for data products, their design, testing, etc.  The more complex the whole ecosystem becomes, the higher the chances for systemic errors to occur and multiply, respectively to create unwanted behavior of the parties involved. Ecosystems are challenging to monitor and manage. 

The more complex the architecture, the higher the chances for failure. Even if some organizations might succeed, it doesn't mean that such an endeavor is for everybody - a certain maturity in building data architectures, data-based artefacts and managing projects must exist in the organization. Many organizations fail in addressing basic analytical requirements, why would one think that they are capable of handling an increased complexity? Even if one breaks the complexity of a data warehouse to more manageable units, the complexity is just moved at other levels that are more difficult to manage in ensemble. 

Being able to audit and test each data product individually has its advantages, though when a data product becomes part of an aggregate it can be easily get lost in the bigger picture. Thus, is needed a global observability framework that allows to monitor the performance and health of each data product in aggregate. Besides that, there are needed event brokers and other mechanisms to handle failure, availability, security, etc. 

Data products make sense in certain scenarios, especially when the complexity of architectures is manageable, though attempting to redesign everything from their perspective is like having a hammer in one's hand and treating everything like a nail.

Previous Post <<||>> Next Post

08 December 2023

SQL Reloaded: Monitoring the Microsoft Fabric Warehouse I

I was exploring this week the Microsoft Fabric Warehouse and I observed that there're three views available under the queryinsight schema: exec_requests_history, frequently_run_queries and long_running_queries,  According to their definitions, they are based on two database objects fabric_query_starting and fabric_query_completed that cannot be called directly.

Announced in the Nov-2023 update, the Query Insights (QI) feature is a "scalable, sustainable, and extendable solution to enhance the SQL analytics experience" (see Microsoft's documentation).

Strangely, the three views appear in the Model view together with the objects defined in the dbo or other user-defined schemas. One can hide the queryinsight objects, however doing this operation in each model is impractical, especially when the number of the objects defined in the respective schema will increase in time. On the other side, the respective objects might be useful in building a report for visualizing query's performance. (Probably, a multi-model solution would and/or further settings will allow more flexibility.)

Secondly, the objects from the queryinsight schema are not available in the sys.objects DMV:

SELECT top 10 *
FROM sys.objects 
WHERE name LIKE 'fabric%'

The exec_requests_history DMV (similar to the dm_exec_requests_history DMV from the standard SQL Server) references several DMVs, and it would be useful to retrieve the corresponding information within the same query:

 -- fabric warehouse
 SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
--, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh
WHERE status = 'Succeeded'--'Failed'

However, attempting to retrieve session information via the sys.dm_exec_sessions DMV leads to the below error message:

SELECT *
FROM queryinsights.exec_requests_history erh
     LEFT JOIN sys.dm_exec_sessions ses
       ON erh.session_id = ses.session_id
WHERE erh.status = 'Succeeded'--'Failed'

The query references an object that is not supported in distributed processing mode.
Msg 15816, Level 16, State 7, Code line 11

Using the standard SQL Sever system functions seem to work, as long the view from the queryinsights schema is not considered:

According to the documentation, "Some objects, like system views, and functions can't be used while you query data stored in Azure Data Lake or Azure Cosmos DB analytical storage. Avoid using the queries that join external data with system views, load external data in a temp table, or use some security or metadata functions to filter external data."

One can presume thus that fabric_query_starting and fabric_query_completed are stored in the Data Lake and behave like standard user-defined tables. Unfortunately, no documentation seems to be available on this.

I tried using a temporary table, as advised above:

-- dropping the temp table
--DROP TABLE IF EXISTS dbo.#requests_history;

-- create the temp table
CREATE TABLE dbo.#requests_history (
  distributed_statement_id uniqueidentifier
, start_time datetime2(6)
, end_time datetime2(6)
, total_elapsed_time_ms bigint
, login_name varchar(255)
, row_count bigint
, status varchar(50)
, session_id bigint
, connection_id uniqueidentifier
, program_name varchar(255)
, batch_id uniqueidentifier
, root_batch_id uniqueidentifier
, query_hash uniqueidentifier
, command varchar(max)
)

-- inserting a few records
INSERT INTO dbo.#requests_history
SELECT erh.distributed_statement_id
, erh.start_time
, erh.end_time
, erh.total_elapsed_time_ms
, erh.login_name
, erh.row_count
, erh.status
, erh.session_id
, erh.connection_id
, erh.program_name
, erh.batch_id
, erh.root_batch_id
, erh.query_hash
, erh.command 
FROM queryinsights.exec_requests_history erh;

-- retrieve the inserted records
SELECT *
FROM dbo.#requests_history;

Unfortunately, the attempt led to the same error message. Further investigating the issue I arrived to a know issue: "Temp table usage in Data Warehouse and SQL analytics endpoint". Hopefully, the fix will address this scenario as well. Otherwise, it might be easier to import the data into a solution (e.g. Power BI) and do in there the analysis.

Using temporary tables with DMVs seems to work (see post).

Please note that the values are case sensitive and only a subset of the standard data types are supported (see documentation).

You might want to check also the queries from the SQL Server System Catalog.

Happy coding!

20 January 2023

SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

Identifying the SQL Server DMVs which are accessible for the Serverless SQL pool (see previous post), allowed me to identify besides sys.dm_exec_requests_history three more DMVs with statistics on the statements executed on the server: sys.dm_request_phases, sys.dm_request_phases_task_group_stats and sys.dm_request_phases_exec_task_stats. Untofurtunately, there seems to be no documentation available on these DMVs, and, at the time the post was written, there were also no further hits on google.com or bing.com found on the same.

sys.dm_request_phases

sys.dm_request_phases  provides insights in the phases an execution statement goes through, and seems to summarize the other two views:

-- Azure Serverless SQL pool: request phases
SELECT TOP (100) dist_statement_id
, RPH.dist_request_id
, TRY_CAST(RPH.id as bigint) id
, TRY_CAST(RPH.parent_ids as bigint) parent_ids
, RPH.start_time
, RPH.end_time
--, RPH.total_elapsed_time_ms
--, RPH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, RPH.min_time_ms
--, RPH.min_time_ms/1000.0 min_time_sec
--, RPH.max_time_ms
--, RPH.max_time_ms/1000.0 max_time_sec
--, RPH.avg_time_ms
, RPH.avg_time_ms/1000.0 avg_time_sec
--, RPH.stdev_time_ms
--, RPH.stdev_time_ms/1000.0 stdev_time_sec -- it has no values
--, RPH.min_rows
--, RPH.max_rows
--, RPH.avg_rows
--, RPH.stdev_rows -- it has no values
, RPH.total_rows
--, RPH.total_bytes_processed
, RPH.total_bytes_processed/1028.0 total_kb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
FROM sys.dm_request_phases RPH
ORDER BY Id
dist_statement_iddist_request_ididparent_idsstart_timeend_timeavg_time_sectotal_rowstotal_kb_processedstate_descoperation_typeinput_dopoutput_doptask_retrieserror_id
8C4386DC...820E9FC6...12...09:58:34.213...09:58:36.3371.03120302343.310311succeededShuffle1100
8C4386DC...820E9FC6...20...09:58:36.447...09:58:39.7131.89197145.193579succeededReturn1100
C9524971...680DCB55...34...10:05:46.747...10:05:47.0570.20320302343.310311succeededShuffle1100
C9524971...680DCB55...40...10:05:47.057...10:05:48.4801.40606630.101167succeededReturn1100
FD2D17AD...C9453EF2...56...11:58:54.060...11:58:55.2970.547101534.098249succeededComputeToControlNode1100
FD2D17AD...C9453EF2...60...11:58:55.297...11:58:55.4200.125104.074902succeededReturn1100
9FB0A268...CAA533DE...78...11:59:16.483...11:59:16.7000.20320302343.310311succeededShuffle1100
9FB0A268...CAA533DE...80...11:59:16.700...11:59:18.6401.92267143.673151succeededReturn1100
1732AB0D...AC1A4F10...910...11:59:25.950...11:59:26.1400.17220302343.310311succeededShuffle1100
1732AB0D...AC1A4F10...100...11:59:26.140...11:59:27.4501.29796635.185797succeededReturn1100

Notes:
1) The foreign keys and dates (in the above and below queries) were truncated to accomodate all the important attributes in the snapshot of the values returned.
2) Based on the exisitng queries, there are two records for each executed statement, a Shuffle or ComputeToControlNode followed by a Return (see operation_type). In more complex scenario there are several Shuffles and Broadcasts and a Return. According to the Microsoft team, even if for serverless SQL pools there's no Data Movement Service (DMS), there's a similar algorithm responsible for moving the data between the nodes.
3) Because in serverless SQL pool each query has its own distribution statement id, the min, max, avg and total values will have the sames values across the columns. Therefore, the columns with redundant values were commented.
4) The Id of the request phase seems to have numeric values despite being defined as alphanumeric. I tried to cast the values to bigint for sorting purposes.

sys.dm_request_phases_task_group_stats

sys.dm_request_phases_task_group_stats stores metadata about the requests breakdown at task group:

-- Azure Serverless SQL pool: request phases breakdown at task group
SELECT TOP (100) RPT.dist_request_id
, TRY_CAST(RPT.id as bigint) id
, TRY_CAST(RPT.parent_ids as bigint) parent_ids
, RPT.dist_statement_id
, RPT.state_desc
, RPT.start_time
, RPT.end_time
, RPT.input_dop
, RPT.output_dop
, RPT.operation_type
, RPT.task_retries
FROM sys.dm_request_phases_task_group_stats RPT
ORDER BY id
dist_request_ididparent_idsdist_statement_idstate_descstart_timeend_timeinput_dopoutput_dopoperation_typetask_retries
820E9FC6...128C4386DC...succeeded63809805514213255163809805516338269311Shuffle0
820E9FC6...208C4386DC...succeeded63809805516447616363809805519713300111Return0
680DCB55...34C9524971...succeeded63809805946745002163809805947057495311Shuffle0
680DCB55...40C9524971...succeeded63809805947057495363809805948479368211Return0
C9453EF2...56FD2D17AD...succeeded63809812734060711263809812735295106711ComputeToControlNode0
C9453EF2...60FD2D17AD...succeeded63809812735295106763809812735420297011Return0
CAA533DE...789FB0A268...succeeded63809812756482608463809812756701350411Shuffle0
CAA533DE...809FB0A268...succeeded63809812756701350463809812758638854911Return0
AC1A4F10...9101732AB0D...succeeded63809812765951362063809812766138851411Shuffle0
AC1A4F10...1001732AB0D...succeeded63809812766138851463809812767451360111Return0

Notes:
1) The DVM seems to return the same number of records as sys.dm_request_phases.
2) Observe the format of the start_time and end_time, probably the timestamps come from the Spark cluster and were not translated into an SQL Server data type.

sys.dm_request_phases_exec_task_stats

sys.dm_request_phases_exec_task_stats stores metadata about the requests breakdown at task level:

-- Azure Serverless SQL pool: request phases breakdown at task
SELECT TOP (100) RPE.dist_request_id
, TRY_CAST(RPE.id as bigint) id
--, RPE.min_time_ms
--, RPE.max_time_ms
, RPE.avg_time_ms/1000.0 avg_time_sec
--, RPE.stdev_time_ms
, RPE.total_bytes_processed
--, RPE.min_rows
--, RPE.max_rows
--, RPE.avg_rows
--, RPE.stdev_rows
, RPE.total_rows
, RPE.error_id
FROM sys.dm_request_phases_exec_task_stats RPE
ORDER BY id
dist_request_ididavg_time_sectotal_kb_processedtotal_rowserror_id
820E9FC6...11.0312343.31031120300
820E9FC6...21.8917145.19357990
680DCB55...30.2032343.31031120300
680DCB55...41.4066630.10116700
C9453EF2...50.5471534.098249100
C9453EF2...60.1254.074902100
CAA533DE...70.2032343.31031120300
CAA533DE...81.9227143.67315160
AC1A4F10...90.1722343.31031120300
AC1A4F10...101.2976635.18579790

What does all this mean?

The lack of documentation makes it challenging to interpret the values of the views besides the data and metadata they offer. In a paper on POLARIS, the code given to the serveless SQL pool engine, a taks is defined as "a careful packaging of data and query processing into units [...] that can be readily moved across compute nodes and re-started at the task level" [1]. Therefore, one can assume that this is the level targetted by the sys.dm_request_phases_exec_task_stats DMV. Further on, the tasks are grouped at phase level according to the sys.dm_request_phases_task_group_stats, the metadata from the two DMVs being further combined into sys.dm_request_phases DMV. 

If the meaning is kept from dedicated SQL pools, a shuffle operation indicates that data is moved between the frontend and backend nodes to satisfy a request, while a Result represents the operation of returning the result selt to client. The "ComputeToControlNode" operation involves a simple select (e.g. SELECT top 10) from a CETA and therefore no "Shuffle" is needed.

Requests' history

Further on, one can use the "Distributed statement id" to join the execution request phases with the request history, however matches will be found only for a small subset of the records (probably the executions since the pool started):
 
-- Azure Serverless SQL pool: requests history with request phase info
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
--, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb
, RPH.avg_time_ms/1000.0 avg_time_sec
, RPH.total_rows
, RPH.total_bytes_processed/1028.0/1028.0 total_mb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
     JOIN sys.dm_request_phases RPH
	   ON ERH.distributed_statement_Id = RPH.dist_statement_id
	  --AND RPH.parent_ids = 0 -- only the parent
ORDER BY RPH.Id DESC

Here's a subset of the result set focusing only on the statistical values:
 
distr_statement_Idstart_timeend_timetotal_elapsed_time_secdata_processed_mbavg_time_sectotal_rowstotal_mb_processedoperation_typeidparent_ids
{8C4386D......8:24.4300000...8:39.826666615.396101.03120302.279484738326Shuffle12
{8C4386D......8:24.4300000...8:39.826666615.396101.89196.950577411478Return20
{C952497......5:45.2100000...5:48.49333333.283100.20320302.279484738326Shuffle34
{C952497......5:45.2100000...5:48.49333333.283101.40606.449514753891Return40
{FD2D17A......8:52.1400000...8:55.41666663.276100.547101.492313471789ComputeToControlNode56
{FD2D17A......8:52.1400000...8:55.41666663.276100.125100.003963912451Return60
{9FB0A26......9:15.1300000...9:18.63666663.506100.20320302.279484738326Shuffle78
{9FB0A26......9:15.1300000...9:18.63666663.506101.92266.949098395914Return80
{1732AB0......9:24.6900000...9:27.45000002.76100.17220302.279484738326Shuffle910
{1732AB0......9:24.6900000...9:27.45000002.76101.29796.454460892023Return100

Notes:
As can be seen, the volume of data processed and the elapsed time values don't match between the two tables, though they are close. The differences probably result from further steps occuring in the process. 

Happy coding!

References:
[1] Josep Aguilar-Saborit, Raghu Ramakrishnan et al, "POLARIS: The Distributed SQL Engine in Azure Synapse", VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020, DOI: https://doi.org/10.14778/3415478.3415545

15 January 2023

SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views I

I feel sometimes flying blind when I build or troubleshoot SQL queries and I don't have the query plan and/or further statistics to understand how the database engine works, why some queries take longer than expected, etc. Unfortunately, Synapse serverless SQL pool doesn't seem to support showing exection plans in SQL Server Management Studio as per now (SHOWPLAN_XML is not supported for SET). I looked at my old queries based on the the sys.dm_exec_requests and sys.dm_exec_query_stats  DMVs, however the results didn't proved to be what I was searching for. (

This weekend, I found Sidney Cirqueira's post on monitoring Synapse serverless SQL pools where he describes how to do that via the Monitoring hub, DMVs, QPI library, respectively Log Analytics. (You should check regularly the Azure Synapse Analytics Blog as it's full of goodies!)

Thus, I found out that there's a new DMV called sys.dm_exec_requests_history which provides at least the duration and the volume of data processes by each statement run on the service:

-- Azure Serverless SQL pool: requests' history 
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb/1028.0 data_processed_gb
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
ORDER BY ERH.data_processed_mb DESC

It isn't much information, compared with the columns returned by sys.dm_exec_requests, but it's something to start with. At least it allows focusing on the queries with the longest duration (use the above query sorting the records based on the total_elapsed_time_ms descending) or highest volume of data processed:

-- Azure Serverless SQL pool: queries with most data processed
SELECT TOP 50 ERH.query_text  
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY ERH.query_text
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

The same query can be slightly changed to retrieve the volume of data processed by month:
 
-- Azure Serverless SQL pool: data processed by month
SELECT Convert(nvarchar(7), ERH.start_time, 23) [period]
, COUNT(*) no_runs
, SUM(ERH.total_elapsed_time_ms) total_elapsed_time_ms
, SUM(ERH.data_processed_mb) data_processed_mb
, SUM(ERH.data_processed_mb/1028.0) data_processed_gb
, MIN(ERH.start_time) first_run_date
, MAX(ERH.start_time) last_run_date
FROM sys.dm_exec_requests_history ERH
GROUP BY Convert(nvarchar(7), ERH.start_time, 23)
HAVING COUNT(*)>1
ORDER BY data_processed_mb DESC

One can add in the grouping also the login name to break down the analysis by the login that issued the query. Organization's domain can be used to differentiate between system or organization-baed queries.

The volume of data processed is stored also in the sys.dm_external_data_processed DMV aggregated for the current day, week, respectively month as part of the cost control related feature:
 
-- Azure Serverless SQL pool: volume of data processed
SELECT type 
, data_processed_mb 
, data_processed_mb/1028.0 data_processed_gb
FROM sys.dm_external_data_processed

And here's how the output looks like:
 
typedata_processed_mbdata_processed_gb
daily2300.223735
weekly3770.366731
monthly223522217.433852

Notes:
1) I still need to play with the DMVs to understand their scope and limitations.
2) The view appears also in the list of DMVs I idenfitied to be supported the by Synapse serverless SQL pool. As I discoered later, 3 more DMVs are available with useful statistics.
3) The queries based on sys.dm_exec_requests and sys.dm_exec_query_stats DMVs seem to return only the running query based on them.  (Actually, the DMVs seem to work.)
4) The view is available also in SQL Server 2022, though it doesn't seem to be used.
5) According to the above-mentioned source, the view is provided for ticket purposes to help customers better troubleshooting the SQL requests. Use the distributed_statement_id in the tickets raised with Microsoft to troubleshoot any issues with Synapse.
6) Unfortunately, also the useful Query Store feature is not yet supported, even if the DMVs related to it seem to be available. Attempting to enable it results in the error:
Msg 15869, Level 16, State 9, Line 1
QUERY_STORE is not supported for ALTER DATABASE

Happy coding!

13 January 2021

SQL Server Administration: Monitoring the Database Logs

One of the aspects to monitor on a SQL Server instance is the size of the logs available for each database, respectively the degree to which the logs are used. Starting with SQL Server 2005 this could be achieved by using the 'Log File(s) Used Size (KB)' and 'Log File(s) Size (KB)'  counters via the sys.dm_os_performance_counters DMV as follows:

-- log files - size (kb)
SELECT lfu.instance_name database_name
, lfu.cntr_value size_kb
, Cast(lfu.cntr_value/1024.00 as decimal (18,2)) size_MB
FROM sys.dm_os_performance_counters lfu 
WHERE lfu.counter_name LIKE  'Log File(s) Size (KB)%' 
  AND lfu.object_name LIKE 'SQLServer:Databases%'
  AND lfu.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfu.instance_name

-- log files - used size (kb)
SELECT lfs.instance_name database_name
, lfs.cntr_value used_size_kb
, Cast(lfs.cntr_value/1024.00 as decimal (18,2)) used_size_MB
FROM sys.dm_os_performance_counters lfs
WHERE lfs.counter_name LIKE  'Log File(s) Used Size (KB)%' 
  AND lfs.object_name LIKE 'SQLServer:Databases%'
  AND lfs.instance_name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY lfs.instance_name

The two queries can be combined into one as follows:

-- database log space allocation (SQL Server 2005+)
SELECT db.name database_name
, db.log_reuse_wait_desc 
, Cast(lfs.cntr_value/1024.00 as decimal(28,2)) size_MB
, Cast(lfu.cntr_value/1024.00 as decimal(28,2)) AS used_MB
, Cast(100.00*lfu.cntr_value/lfs.cntr_value as decimal(10,2)) used_percent 
, CASE WHEN CAST(lfu.cntr_value AS float) / CAST(lfs.cntr_value AS float) > .5 THEN 
   CASE 
    WHEN db.name = 'tempdb' AND log_reuse_wait_desc NOT IN ('CHECKPOINT', 'NOTHING') THEN 'WARNING'  
    WHEN db.name <> 'tempdb' THEN 'WARNING' 
    ELSE 'OK' 
    END 
  ELSE 'OK' END log_status 
FROM sys.databases db 
     JOIN sys.dm_os_performance_counters lfs 
       ON db.name = lfs.instance_name 
      AND lfs.counter_name LIKE 'Log File(s) Size (KB)%' 
     JOIN sys.dm_os_performance_counters lfu 
       ON db.name = lfu.instance_name 
      AND lfu.counter_name LIKE  'Log File(s) Used Size (KB)%' 
WHERE db.name IN ('tempdb', 'master', 'model', 'msdb')
ORDER BY db.name 

Output:
database_namelog_reuse_wait_descsize_MBused_MBused_percentlog_status
masterNOTHING1.991.1155.78WARNING
modelNOTHING7.991.7421.80OK
msdbNOTHING28.801.966.81OK
tempdbNOTHING999.990.690.07OK

Starting with SQL Server 2012 the same information can be obtained via the sys.dm_db_log_space_usage DMV, however the view returns information only for the current database:

-- getting the log space only for a database (SQL Server 2012+)
SELECT db_name(database_id) database_name 
, Cast(total_log_size_in_bytes/1024.00/1024.00 as decimal(28,2)) size_MB
, Cast(used_log_space_in_bytes/1024.00/1024.00 as decimal(28,2)) used_MB
, Cast(used_log_space_in_percent as decimal(28,2)) used_percent
FROM sys.dm_db_log_space_usage

Output:
database_namesize_MBused_MBused_percent
master1.991.1959.61

With less flexibility one can obtain the size in MB and the used percentage by using the DBCC utility as follows:

-- retrieving the log usage for all databases
DBCC SQLPERF(LOGSPACE); 

Output:
Database NameLog Size (MB)Log Space Used (%)Status
master1,99218833,137260
tempdb999,99220,063525890
model7,99218821,114370
msdb28,804696,6178460
AdventureWorks201433,9921914,766720
AdventureWorksDW201417,9921922,68780

Notes:
1. All the mentioned objects require VIEW SERVER STATE permissions.
2. The solution based on the performance counters returns slightly different values than the other solutions, though the differences are neglectable. 

Resources: 
[1] SQL Docs (2017) sys.dm_os_performance_counters [source]
[2] SQL Docs (2017) DBCC SQLPERF [source]
[3] SQL Docs (2017) sys.dm_db_log_space_usage [source]

05 January 2021

ERP Implementations: It’s all about Scope II (Nonfunctional Requirements & MVP)

ERP Implementation

Nonfunctional Requirements

In contrast to functional requirements (FRs), nonfunctional requirements (NFRs) have no direct impact on system’s behavior, affecting end-users’ experience with the system, resuming thus to topics like performance, usability, reliability, compatibility, security, monitoring, maintainability, testability, respectively other constraints and quality attributes. Even if these requirements are in general addressed by design, the changes made to the system have the potential of impacting users’ experience negatively.  

Moreover, the NFRs are usually difficult to quantify, and probably that’s why they are seldom made explicit in a formal document or are considered eventually only at high level. However, one can still find a basis for comparison against compliance requirements, general guidelines, standards, best practices or the legacy system(s) (e.g. the performance should not be worse than in the legacy system, the volume of effort for carrying the various activities should not increase). Even if they can’t be adequately described, it’s recommended to list the NFRs in general terms in a formal document (e.g. implementation contract). Failing to do so can open or widen the risk exposure one has, especially when the system lacks important support in the respective areas. In addition, these requirements need to be considered during testing and sign-off as well. 

Minimum Viable Product (MVP)

Besides gaps’ consideration in respect to FRs, it’s important to consider sometimes on whether the whole functionality is mandatory, especially when considering the various activities that need to be carried out (parametrization, Data Migration).

For example, one can target to implement a minimum viable product (MVP) - a version of the product which has just enough features to cover the mandatory or the most important FRs. The MVP is based on the idea that implementing about 80% of the needed functionality has in theory the potential of providing earlier a usable product with a minimum of effort (quick wins), assure that project’s goals and objectives were met, respectively assure a basis for further development. In case of cost overruns, the MVP assures that the business has a workable product and has the opportunity of deciding whether it’s worth of investing more into the project now or later. 

The MVP allows also to get early users’ feedback and integrate it into further enhancements and developments. Often the users understand the capabilities of a system, respectively implementation, only when they are able using the system. As this is a learning process, the learning period can take up to a few months until adequate feedback is available. Therefore, postponing implementation’s continuation with a few months can have in theory a positive impact, however it can come also with drawbacks (e.g. the resources are not available anymore). 

A sketch of the MVP usually results from requirements’ prioritization, however then requirements need to be regarded holistically, as there can be different levels of dependencies existing between them. In addition, different costs can incur if the requirements will be handled later, and other constrains may apply as well. Considering an MVP approach can be a sword with two edges. In the worst-case scenario, the business will get only the MVP, with its good and bad characteristics. The business will be forced then to fill the gaps by working outside the system, which can lead to further effort and, in extremis, with poor acceptance of the system. In general, users expect having their processes fully implemented in the system, expectation which is not always economically grounded.

After establishing an MVP one can consider the further requirements (including improvement suggestions) based on a cost-benefit basis and implement them accordingly as part of a continuous improvement initiative, even if more time will be maybe required for implementing the same.

Previous Post <<||>> Next Post



06 July 2020

SQL Server Administration: Undocumented III (SQL Server CPU Utilization via the Ring Buffer)

Introduction

If no proper monitoring solution of the SQL Server and the hosting server is in place to review the CPU utilization, one can use the Scheduler Monitor buffer provided by the undocumented sys.dm_os_ring_buffers data management view (DMV). Introduced with SQL Server 2005, the DMV provides significant amount of diagnostic memory information in XML form via several buffers: Resource Monitor, Out-of-Memory, Memory Broker, Buffer Pool, respectively Scheduler Monitor buffer [2]. A ring buffer is a recorded response to a notification [1].

The view changed between the various versions of SQL Server, while with the introduction of Always On availability groups in SQL Server 2017 further buffer rings were made available (see [5]).

Warning:
According to Microsoft (see [4] the sys.dm_os_ring_buffers is provided only for information purposes, the future compatibility post SQL Server 2019 being not guaranteed!

Querying the Scheduler Monitor Buffer

Within the Scheduler Monitor buffer, the DMV stores a history of 4 hours uptime with minute by minute data points (in total 256 entries) with the CPU utilization for the SQL Server, other processes, respectively the system idle time as percentages. It allows thus to identify the peaks in CPU utilization and thus to determine the intervals of focus for further troubleshooting. As the data are stored within an XML structure, the values can be queried via the XQuery syntax as follows: 

-- cpu utilization for SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

SELECT DAT.record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
ORDER BY DAT.record_id DESC;

If the SQL Server is not busy as all, the SQL Server utilization time may tend to 0%, while the system idle time to 90%. (It's the case of my SQL Server lab.)

CPU Utilization for my home lab
CPU Utilization for my home SQL Server lab

Notes:
If the server was restarted within the last 4 hours, then the points will have a gap between two readings corresponding to the downtime interval.
The query is supposed to run also on Linux machines, though the SystemIdle time will be 0. One can thus consider the SQL and non-SQL CPU utilization.

Storing the History

The above query can be run on a regular basis (e.g. every 3-4 hours) via a SSIS package and push the data into a table for historical purposes. Because is needed to have a continuous history of the readings, it's better if the gap between runs is smaller than the 4 hours. No matter of the approach used is better to check for overlappings when storing the data:

-- dropping the table
-- DROP TABLE IF EXISTS dbo.T_RingBufferReadings 

-- reinitilizing the history
-- TRUNCATE TABLE dbo.T_RingBufferReadings

-- creating the table
CREATE TABLE dbo.T_RingBufferReadings (
  Id bigint IDENTITY (1,1) NOT NULL
, RecordId bigint 
, EventTime datetime2(3) NOT NULL
, SQLProcessUtilization int NOT NULL
, SystemIdle int NOT NULL
, OtherUtilization int NOT NULL
)


-- reviewing the data
SELECT *
FROM dbo.T_RingBufferReadings 
ORDER BY EventTime DESC

If there are many records, to improve the performance, one can create also an index, which can include the reading points as well:

-- creating a unique index with an include 
CREATE UNIQUE NONCLUSTERED INDEX [UI_T_RingBufferReadings_EventTime] ON dbo.T_RingBufferReadings
(
	EventTime ASC,
    RecordId ASC
) INCLUDE (SQLProcessUtilization, SystemIdle, OtherUtilization)
GO

The above query based on the DMV becomes:

-- cpu utilization by SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

INSERT INTO dbo.T_RingBufferReadings
SELECT record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
	LEFT JOIN dbo.T_RingBufferReadings RBR
	  ON DAT.record_id = RBR.Recordid 
WHERE RBR.Recordid IS NULL
ORDER BY DAT.record_id DESC;

Note:
A ServerName column can be added to the table if is needed to store the values for different SQL Servers. Then the LEFT JOIN has to consider the new added column. 
Either of the two queries can be used to display the data points within a chart via SSRS, Power BI or any reporting tool available. 

Happy coding!

References:
[1] Grant Fritchey (2014) SQL Server Query Performance Tuning: Troubleshoot and Optimize Query Performance in SQL Server 2014, 4th Ed.
[2] Sunil Agarwal et al (2005), Troubleshooting Performance Problems in SQL Server 2005, Source: TShootPerfProbs.docx
[3] Sunil Agarwal et al (2008), Troubleshooting Performance Problems in SQL Server 2008, Source: TShootPerfProbs2008.docx
[4] Microsoft SQL Docs (2018) Related Dynamic Management Views, Source
[5] Microsoft SQL Docs (2017) Use ring buffers to obtain health information about Always On availability groups, Source

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.