Showing posts with label undocumented feature. Show all posts
Showing posts with label undocumented feature. Show all posts

20 January 2023

SQL Reloaded: Monitoring the Synapse serverless SQL pool with Dynamics Management Views II

Identifying the SQL Server DMVs which are accessible for the Serverless SQL pool (see previous post), allowed me to identify besides sys.dm_exec_requests_history three more DMVs with statistics on the statements executed on the server: sys.dm_request_phases, sys.dm_request_phases_task_group_stats and sys.dm_request_phases_exec_task_stats. Untofurtunately, there seems to be no documentation available on these DMVs, and, at the time the post was written, there were also no further hits on google.com or bing.com found on the same.

sys.dm_request_phases

sys.dm_request_phases  provides insights in the phases an execution statement goes through, and seems to summarize the other two views:

-- Azure Serverless SQL pool: request phases
SELECT TOP (100) dist_statement_id
, RPH.dist_request_id
, TRY_CAST(RPH.id as bigint) id
, TRY_CAST(RPH.parent_ids as bigint) parent_ids
, RPH.start_time
, RPH.end_time
--, RPH.total_elapsed_time_ms
--, RPH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, RPH.min_time_ms
--, RPH.min_time_ms/1000.0 min_time_sec
--, RPH.max_time_ms
--, RPH.max_time_ms/1000.0 max_time_sec
--, RPH.avg_time_ms
, RPH.avg_time_ms/1000.0 avg_time_sec
--, RPH.stdev_time_ms
--, RPH.stdev_time_ms/1000.0 stdev_time_sec -- it has no values
--, RPH.min_rows
--, RPH.max_rows
--, RPH.avg_rows
--, RPH.stdev_rows -- it has no values
, RPH.total_rows
--, RPH.total_bytes_processed
, RPH.total_bytes_processed/1028.0 total_kb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
FROM sys.dm_request_phases RPH
ORDER BY Id
dist_statement_iddist_request_ididparent_idsstart_timeend_timeavg_time_sectotal_rowstotal_kb_processedstate_descoperation_typeinput_dopoutput_doptask_retrieserror_id
8C4386DC...820E9FC6...12...09:58:34.213...09:58:36.3371.03120302343.310311succeededShuffle1100
8C4386DC...820E9FC6...20...09:58:36.447...09:58:39.7131.89197145.193579succeededReturn1100
C9524971...680DCB55...34...10:05:46.747...10:05:47.0570.20320302343.310311succeededShuffle1100
C9524971...680DCB55...40...10:05:47.057...10:05:48.4801.40606630.101167succeededReturn1100
FD2D17AD...C9453EF2...56...11:58:54.060...11:58:55.2970.547101534.098249succeededComputeToControlNode1100
FD2D17AD...C9453EF2...60...11:58:55.297...11:58:55.4200.125104.074902succeededReturn1100
9FB0A268...CAA533DE...78...11:59:16.483...11:59:16.7000.20320302343.310311succeededShuffle1100
9FB0A268...CAA533DE...80...11:59:16.700...11:59:18.6401.92267143.673151succeededReturn1100
1732AB0D...AC1A4F10...910...11:59:25.950...11:59:26.1400.17220302343.310311succeededShuffle1100
1732AB0D...AC1A4F10...100...11:59:26.140...11:59:27.4501.29796635.185797succeededReturn1100

Notes:
1) The foreign keys and dates (in the above and below queries) were truncated to accomodate all the important attributes in the snapshot of the values returned.
2) Based on the exisitng queries, there are two records for each executed statement, a Shuffle or ComputeToControlNode followed by a Return (see operation_type). In more complex scenario there are several Shuffles and Broadcasts and a Return. According to the Microsoft team, even if for serverless SQL pools there's no Data Movement Service (DMS), there's a similar algorithm responsible for moving the data between the nodes.
3) Because in serverless SQL pool each query has its own distribution statement id, the min, max, avg and total values will have the sames values across the columns. Therefore, the columns with redundant values were commented.
4) The Id of the request phase seems to have numeric values despite being defined as alphanumeric. I tried to cast the values to bigint for sorting purposes.

sys.dm_request_phases_task_group_stats

sys.dm_request_phases_task_group_stats stores metadata about the requests breakdown at task group:

-- Azure Serverless SQL pool: request phases breakdown at task group
SELECT TOP (100) RPT.dist_request_id
, TRY_CAST(RPT.id as bigint) id
, TRY_CAST(RPT.parent_ids as bigint) parent_ids
, RPT.dist_statement_id
, RPT.state_desc
, RPT.start_time
, RPT.end_time
, RPT.input_dop
, RPT.output_dop
, RPT.operation_type
, RPT.task_retries
FROM sys.dm_request_phases_task_group_stats RPT
ORDER BY id
dist_request_ididparent_idsdist_statement_idstate_descstart_timeend_timeinput_dopoutput_dopoperation_typetask_retries
820E9FC6...128C4386DC...succeeded63809805514213255163809805516338269311Shuffle0
820E9FC6...208C4386DC...succeeded63809805516447616363809805519713300111Return0
680DCB55...34C9524971...succeeded63809805946745002163809805947057495311Shuffle0
680DCB55...40C9524971...succeeded63809805947057495363809805948479368211Return0
C9453EF2...56FD2D17AD...succeeded63809812734060711263809812735295106711ComputeToControlNode0
C9453EF2...60FD2D17AD...succeeded63809812735295106763809812735420297011Return0
CAA533DE...789FB0A268...succeeded63809812756482608463809812756701350411Shuffle0
CAA533DE...809FB0A268...succeeded63809812756701350463809812758638854911Return0
AC1A4F10...9101732AB0D...succeeded63809812765951362063809812766138851411Shuffle0
AC1A4F10...1001732AB0D...succeeded63809812766138851463809812767451360111Return0

Notes:
1) The DVM seems to return the same number of records as sys.dm_request_phases.
2) Observe the format of the start_time and end_time, probably the timestamps come from the Spark cluster and were not translated into an SQL Server data type.

sys.dm_request_phases_exec_task_stats

sys.dm_request_phases_exec_task_stats stores metadata about the requests breakdown at task level:

-- Azure Serverless SQL pool: request phases breakdown at task
SELECT TOP (100) RPE.dist_request_id
, TRY_CAST(RPE.id as bigint) id
--, RPE.min_time_ms
--, RPE.max_time_ms
, RPE.avg_time_ms/1000.0 avg_time_sec
--, RPE.stdev_time_ms
, RPE.total_bytes_processed
--, RPE.min_rows
--, RPE.max_rows
--, RPE.avg_rows
--, RPE.stdev_rows
, RPE.total_rows
, RPE.error_id
FROM sys.dm_request_phases_exec_task_stats RPE
ORDER BY id
dist_request_ididavg_time_sectotal_kb_processedtotal_rowserror_id
820E9FC6...11.0312343.31031120300
820E9FC6...21.8917145.19357990
680DCB55...30.2032343.31031120300
680DCB55...41.4066630.10116700
C9453EF2...50.5471534.098249100
C9453EF2...60.1254.074902100
CAA533DE...70.2032343.31031120300
CAA533DE...81.9227143.67315160
AC1A4F10...90.1722343.31031120300
AC1A4F10...101.2976635.18579790

What does all this mean?

The lack of documentation makes it challenging to interpret the values of the views besides the data and metadata they offer. In a paper on POLARIS, the code given to the serveless SQL pool engine, a taks is defined as "a careful packaging of data and query processing into units [...] that can be readily moved across compute nodes and re-started at the task level" [1]. Therefore, one can assume that this is the level targetted by the sys.dm_request_phases_exec_task_stats DMV. Further on, the tasks are grouped at phase level according to the sys.dm_request_phases_task_group_stats, the metadata from the two DMVs being further combined into sys.dm_request_phases DMV. 

If the meaning is kept from dedicated SQL pools, a shuffle operation indicates that data is moved between the frontend and backend nodes to satisfy a request, while a Result represents the operation of returning the result selt to client. The "ComputeToControlNode" operation involves a simple select (e.g. SELECT top 10) from a CETA and therefore no "Shuffle" is needed.

Requests' history

Further on, one can use the "Distributed statement id" to join the execution request phases with the request history, however matches will be found only for a small subset of the records (probably the executions since the pool started):
 
-- Azure Serverless SQL pool: requests history with request phase info
SELECT top 100 ERH.status
, ERH.transaction_Id
, ERH.distributed_statement_Id 
, ERH.query_hash 
--, ERH.login_name 
, ERH.start_time
, ERH.end_time 
, ERH.command 
, ERH.query_text 
--, ERH.total_elapsed_time_ms
, ERH.total_elapsed_time_ms/1000.0 total_elapsed_time_sec
--, ERH.data_processed_mb
, ERH.data_processed_mb
, RPH.avg_time_ms/1000.0 avg_time_sec
, RPH.total_rows
, RPH.total_bytes_processed/1028.0/1028.0 total_mb_processed
, RPH.state_desc
, RPH.operation_type
, RPH.input_dop
, RPH.output_dop
, RPH.task_retries
, RPH.error_id
, ERH.error
, ERH.error_code 
FROM sys.dm_exec_requests_history ERH
     JOIN sys.dm_request_phases RPH
	   ON ERH.distributed_statement_Id = RPH.dist_statement_id
	  --AND RPH.parent_ids = 0 -- only the parent
ORDER BY RPH.Id DESC

Here's a subset of the result set focusing only on the statistical values:
 
distr_statement_Idstart_timeend_timetotal_elapsed_time_secdata_processed_mbavg_time_sectotal_rowstotal_mb_processedoperation_typeidparent_ids
{8C4386D......8:24.4300000...8:39.826666615.396101.03120302.279484738326Shuffle12
{8C4386D......8:24.4300000...8:39.826666615.396101.89196.950577411478Return20
{C952497......5:45.2100000...5:48.49333333.283100.20320302.279484738326Shuffle34
{C952497......5:45.2100000...5:48.49333333.283101.40606.449514753891Return40
{FD2D17A......8:52.1400000...8:55.41666663.276100.547101.492313471789ComputeToControlNode56
{FD2D17A......8:52.1400000...8:55.41666663.276100.125100.003963912451Return60
{9FB0A26......9:15.1300000...9:18.63666663.506100.20320302.279484738326Shuffle78
{9FB0A26......9:15.1300000...9:18.63666663.506101.92266.949098395914Return80
{1732AB0......9:24.6900000...9:27.45000002.76100.17220302.279484738326Shuffle910
{1732AB0......9:24.6900000...9:27.45000002.76101.29796.454460892023Return100

Notes:
As can be seen, the volume of data processed and the elapsed time values don't match between the two tables, though they are close. The differences probably result from further steps occuring in the process. 

Happy coding!

References:
[1] Josep Aguilar-Saborit, Raghu Ramakrishnan et al, "POLARIS: The Distributed SQL Engine in Azure Synapse", VLDB Conferences. PVLDB, 13(12): 3204 – 3216, 2020, DOI: https://doi.org/10.14778/3415478.3415545

13 January 2021

SQL Server Administration: Undocumented IV (DBCC SQLPERF)

Besides the documented LOGSPACE parameter (see previous post), DBCC SQLPERF utility has several undocumented parameters which allow providing statistics about schedulers, threads, spinlocks, IO, network, read-aheads, respectively waits.  

Scheduler Statistics

By providing 'umsstats' as parameter, the utility returns as result the visible UMS schedulers on the system:

-- visible UMS schedulers
DBCC SQLPERF(umsstats)

Output:
StatisticValue
Node Id0
Avg Sched Load5
Sched Switches6903
Sched Pass6721358
IO Comp Passes11334
Scheduler ID0
    online1
    num tasks6
    num runnable0
    num workers9
    active workers6
    work queued0
    cntxt switches7125444
    cntxt switches(idle)9898903
    preemptive switches2304
Scheduler ID1
    online1
    num tasks6
    num runnable0
    num workers9
    active workers5
    work queued0
    cntxt switches3370432
    cntxt switches(idle)4427991
    preemptive switches22729

The fields have the following meaning: 
StatisticDescription
Node Id
Avg Sched Load
Sched SwitchesThe number of switches between schedulers
Sched Pass
IO Comp Passes
Scheduler IDThe scheduler's zero-based ID number
num tasksThe number of tasks associated with the scheduler
num runnableThe number of workers on the runnable list
num workersThe total number of workers associated with the scheduler
idle workersThe number of idle workers
work queuedThe number of items waiting to be processed in the work queue
cntxt switchesThe number of switches between workers for the scheduler
cntxt switches(idle)The number of times the idle loop was switched into

The functionality is useful when one suspects that there are issues related to the visible schedulers.

Detailed information about the schedulers can be found also via the sys.dm_os_schedulers DMV.

Threads Statistics

By providing 'threads' as parameter, the utility returns as result the threads created in the system:

--  thread statistics
DBCC SQLPERF(threads)

Output (just the first records):
SpidThread IDStatusLoginNameIOCPUMemUsage
111228backgroundsa000
212572backgroundsa000
312576backgroundsa000
411884backgroundsa000
512964backgroundsa000
612960backgroundsa004
712968backgroundsa000

Detailed information about the schedulers can be found also via the sys.dm_os_threads DMV.

IO Statistics

By providing 'iostats' as parameter, the utility returns as result a count of the outstanding reads, respectively writes:

--  IO statistics
DBCC SQLPERF(iostats)

Output:
StatisticValue
Reads Outstanding0
Writes Outstanding0

Network Statistics

By providing 'netstats' as parameter, the utility returns as result network-related statistics:

--  network statistics
DBCC SQLPERF(netstats)

Output:
StatisticValue
Network Reads6976
Network Writes9036
Network Bytes Read5318957
Network Bytes Written2,222512E+07
Command Queue Length0
Max Command Queue Length0
Worker Threads0
Max Worker Threads0
Network Threads0
Max Network Threads0

Read Ahead Statistics

By providing 'rastats' as parameter, the utility returns read-ahead statistics:

--  read ahead statistics
DBCC SQLPERF(rastats)

Output
StatisticValue
RA Pages Found in Cache0
RA Pages Placed in Cache0
RA Physical IO0
Used Slots0

Spinlock Statistics

By providing 'spinlockstats' as parameter, the utility returns the spinlock statistics, where a spinlock is a a lightweight synchronization object used to serialize access to data structures which are typically held for a short period of time:

--  spinlock statistics
DBCC SQLPERF(spinlockstats)

Output:
Spinlock NameCollisionsSpinsSpins/CollisionSleep Time (ms)Backoffs
LOCK_RW_TEST00000
LOCK_RW_SECURITY_CACHE711210500296,0619054
LOCK_RW_CMED_HASH_SET5175035001
LOCK_RW_ABTX_HASH_SET00000
LOCK_RW_RBIO_REQ00000

Detailed information about the spinlock stats can be found also via the sys.dm_os_spinlock_stats DMV.

Wait Statistics

By providing 'waitstats' as parameter, the utility returns the available wait statistics:

-- wait statistics 
DBCC SQLPERF(waitstats)

Output (only a few records):
Wait TypeRequestsWait TimeSignal Wait Time
LCK_M_SCH_M185973
LCK_M_S13168950
PAGEIOLATCH_SH17893856825
PAGEIOLATCH_UP1181010
PAGEIOLATCH_EX7361049016

Detailed information about the wait stats can be found also via sys.dm_os_wait_stats DMV.

Notes:
As Microsoft warns, the undocumented features shouldn't be used into production environments as they will be deprecated in future versions. Instead should be used the documented DMVs, when available. 
All objects mentioned above require VIEW SERVER STATE permissions.
The DBCC SQLPERF utility allows resetting the latch, spinlock, respectively the wait statistics by providing the following parameters (see the SQL Docs for more information): 

-- resetting the latch statistics
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR)

-- resetting the spinlock statistics
DBCC SQLPERF ('sys.dm_os_spinlock_stats', CLEAR);  

-- resetting the wait statistics
DBCC SQLPERF('sys.dm_os_latch_stats', CLEAR)

DBCC PERFMON provides in a single call the IO, network, read ahead, spinlocks, respectively the wait statistics in distinct resultsets:

-- performance statistics
DBCC PERFMON

06 July 2020

SQL Server Administration: Undocumented III (SQL Server CPU Utilization via the Ring Buffer)

Introduction

If no proper monitoring solution of the SQL Server and the hosting server is in place to review the CPU utilization, one can use the Scheduler Monitor buffer provided by the undocumented sys.dm_os_ring_buffers data management view (DMV). Introduced with SQL Server 2005, the DMV provides significant amount of diagnostic memory information in XML form via several buffers: Resource Monitor, Out-of-Memory, Memory Broker, Buffer Pool, respectively Scheduler Monitor buffer [2]. A ring buffer is a recorded response to a notification [1].

The view changed between the various versions of SQL Server, while with the introduction of Always On availability groups in SQL Server 2017 further buffer rings were made available (see [5]).

Warning:
According to Microsoft (see [4] the sys.dm_os_ring_buffers is provided only for information purposes, the future compatibility post SQL Server 2019 being not guaranteed!

Querying the Scheduler Monitor Buffer

Within the Scheduler Monitor buffer, the DMV stores a history of 4 hours uptime with minute by minute data points (in total 256 entries) with the CPU utilization for the SQL Server, other processes, respectively the system idle time as percentages. It allows thus to identify the peaks in CPU utilization and thus to determine the intervals of focus for further troubleshooting. As the data are stored within an XML structure, the values can be queried via the XQuery syntax as follows: 

-- cpu utilization for SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

SELECT DAT.record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
ORDER BY DAT.record_id DESC;

If the SQL Server is not busy as all, the SQL Server utilization time may tend to 0%, while the system idle time to 90%. (It's the case of my SQL Server lab.)

CPU Utilization for my home lab
CPU Utilization for my home SQL Server lab

Notes:
If the server was restarted within the last 4 hours, then the points will have a gap between two readings corresponding to the downtime interval.
The query is supposed to run also on Linux machines, though the SystemIdle time will be 0. One can thus consider the SQL and non-SQL CPU utilization.

Storing the History

The above query can be run on a regular basis (e.g. every 3-4 hours) via a SSIS package and push the data into a table for historical purposes. Because is needed to have a continuous history of the readings, it's better if the gap between runs is smaller than the 4 hours. No matter of the approach used is better to check for overlappings when storing the data:

-- dropping the table
-- DROP TABLE IF EXISTS dbo.T_RingBufferReadings 

-- reinitilizing the history
-- TRUNCATE TABLE dbo.T_RingBufferReadings

-- creating the table
CREATE TABLE dbo.T_RingBufferReadings (
  Id bigint IDENTITY (1,1) NOT NULL
, RecordId bigint 
, EventTime datetime2(3) NOT NULL
, SQLProcessUtilization int NOT NULL
, SystemIdle int NOT NULL
, OtherUtilization int NOT NULL
)


-- reviewing the data
SELECT *
FROM dbo.T_RingBufferReadings 
ORDER BY EventTime DESC

If there are many records, to improve the performance, one can create also an index, which can include the reading points as well:

-- creating a unique index with an include 
CREATE UNIQUE NONCLUSTERED INDEX [UI_T_RingBufferReadings_EventTime] ON dbo.T_RingBufferReadings
(
	EventTime ASC,
    RecordId ASC
) INCLUDE (SQLProcessUtilization, SystemIdle, OtherUtilization)
GO

The above query based on the DMV becomes:

-- cpu utilization by SQL Server and other applications
DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)
        FROM sys.dm_os_sys_info); 

INSERT INTO dbo.T_RingBufferReadings
SELECT record_id
, DAT.EventTime
, DAT.SQLProcessUtilization 
, DAT.SystemIdle 
, 100 - (DAT.SystemIdle + DAT.SQLProcessUtilization) OtherUtilization
FROM ( 
	SELECT record.value('(./Record/@id)[1]', 'int') record_id
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') SystemIdle 
	, record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') SQLProcessUtilization
	, EventTime 
	FROM ( 
		SELECT DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) EventTime
		, [timestamp]
		, CONVERT(xml, record) AS [record] 
		FROM sys.dm_os_ring_buffers 
		WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
		  AND record LIKE N'%<SystemHealth>%') AS x 
	) AS DAT
	LEFT JOIN dbo.T_RingBufferReadings RBR
	  ON DAT.record_id = RBR.Recordid 
WHERE RBR.Recordid IS NULL
ORDER BY DAT.record_id DESC;

Note:
A ServerName column can be added to the table if is needed to store the values for different SQL Servers. Then the LEFT JOIN has to consider the new added column. 
Either of the two queries can be used to display the data points within a chart via SSRS, Power BI or any reporting tool available. 

Happy coding!

References:
[1] Grant Fritchey (2014) SQL Server Query Performance Tuning: Troubleshoot and Optimize Query Performance in SQL Server 2014, 4th Ed.
[2] Sunil Agarwal et al (2005), Troubleshooting Performance Problems in SQL Server 2005, Source: TShootPerfProbs.docx
[3] Sunil Agarwal et al (2008), Troubleshooting Performance Problems in SQL Server 2008, Source: TShootPerfProbs2008.docx
[4] Microsoft SQL Docs (2018) Related Dynamic Management Views, Source
[5] Microsoft SQL Docs (2017) Use ring buffers to obtain health information about Always On availability groups, Source

25 June 2020

SQL Server Administration: Undocumented I (Reading the SQL Server Log)

Starting SQL Server 2005 Microsoft made available the sys.xp_readerrorlog system stored procedure, which allows reading the SQL Server as well the SQL Server Agent logs one by one. Since I learned about it, I prefer using it to search for errors in the log instead of using the Management Studio because it allows more flexibility and often seems to be much faster than the Log File Viewer.

The stored procedure exposes the following parameters:
p1: numeric value identifying the index of the log file: 0 = current, 1 = archive #1, 2 = archive #2, ...
p2: log file type: 1 or NULL = SQL Server log, 2 = SQL Agent log
p3: the string  to search for (e.g. Error)
p4: a second string t to search for (allows refining the results)
p5: start time of the events to be considered
p6: end time of the events to be considered
p7: sorting order for results: N'asc' = ascending, N'desc' = descending

The only required parameter is the first, thus the stored procedure can be called as follows:

-- reading the current SQL Server log (to be run individually)
EXEC xp_readerrorlog 0
EXEC xp_readerrorlog 0, 1
EXEC xp_readerrorlog 0, 1, N'Login failed' -- filters for failed logins 
EXEC xp_readerrorlog 0, 1, N'DBCC CHECKDB' -- filters for database consistency checks 
EXEC xp_readerrorlog 0, 1, N'Error' -- filters the Errors
EXEC xp_readerrorlog 0, 1, N'Error', N'Severity: 14' -- filters the Errors with severity 14
EXEC xp_readerrorlog 0, 1, N'', N'', '20200601', NULL, NULL -- start date
EXEC xp_readerrorlog 0, 1, N'', N'', '20200601', '20200607', NULL -- start date & end date 
EXEC xp_readerrorlog 0, 1, N'', N'', NULL, NULL, N'asc' -- ascending sorting

-- reading the last SQL Server log
EXEC xp_readerrorlog 1, 1


-- reading the current SQL Agent log
EXEC xp_readerrorlog 0, 2

Especially when filtering for certain information, the output can be displayed easily into a SSRS report. Besides errors typically I'm looking for failed logins or the results of consistency checks.

The number of log files available depends on how the SQL Server Logs were configured.

Even if the stored procedure provides the needed information fast, the error and the corresponding description are shown in separate records with the same timestamp. To process the records one can use a temporary table matching output's definition. To facilitate the selection of consecutive records a primary key can be added as well. One can dump into the table some or all of the log files and perform various searches or analyses.

-- dropping the table
--DROP TABLE IF EXISTS #tbl 

-- creating a temporary table
CREATE TABLE #tbl (
  id int IDENTITY(1,1) NOT NULL
, LogDate datetime2(3)
, ProcessInfo nvarchar(50)
, LogText varchar(max))    
  
-- reading the current log into it
INSERT INTO #tbl        
EXEC sys.xp_readerrorlog 0, 1

-- reading the last archive log into it
INSERT INTO #tbl        
EXEC sys.xp_readerrorlog 1, 1

-- retrieving the errors together with their descriptions 
SELECT A.LogDate
, A.ProcessInfo
, A.LogText
, B.LogText
FROM #tbl A
     JOIN #tbl B
    ON A.Id = B.Id - 1
WHERE A.LogText LIKE '%Error:%'

Here's the output of the last query (it will look differently on your server):


The temporary table can be used for further analyses (e.g. displaying the first, respectively last occurrence of the error together with a count):

-- first/last occurence of an error 
SELECT Min(A.LogDate) FirstOccurence
, Max(A.LogDate) LastOccurence
, count(*) NoRecords 
, A.LogText
, B.LogText
FROM #tbl A
     JOIN #tbl B
    ON A.Id = B.Id - 1
WHERE A.LogText LIKE '%Error:%'
GROUP BY A.LogText
, B.LogText


Warnings:
Do not forget to dump the temporary table when finished!
The code is provided only for exemplification purposes. You can use the above code on your own risk!
Be careful when the number of records in the logs is too big, because reading all the records can create temporary performance issues!
Undocumented features can be deprecated in future versions of SQL Server, therefore they should be used with precaution in long-term solutions.

Happy coding!

29 October 2018

SQL Server Troubleshooting: Searching the Error Log

    Searching for a needle in a haystack is an achievable task though may turn to be daunting. Same can be said about searching for a piece of information in the SQL error log. Fortunately, there is xp_readerrorlog, an undocumented (extended) stored procedure, which helps in the process. The stored procedure makes available the content of the error log and provides basic search capabilities via a small set of parameters. For example, it can be used to search for errors, warnings, failed backups, consistency checks, failed logins, databases instant file initializations, and so on. It helps identify whether an event occurred and the time at which the event occurred.

   The following are the parameter available with the stored procedure:

Parameter
Name
Type
Description
1FileToReadint0 = Current, 1 or 2, 3, … n Archive Number
2Logtypeint1 = SQL Error Log and 2 = SQL Agent log
3String1varchar(255)the string to match the logs on
4String2varchar(255)a second string to match in combination with String1 (AND)
5StartDatedatetimebeginning date to look from
6EndDatedatetimeending date to look up to
7ResultsOrderASC or DESC sorting


Note:
If the SQL Server Agent hasn’t been active, then there will be no Agent log and the call to the stored procedure will return an error.

   Here are a few examples of using the stored procedure:

-- listing the content of the current SQL Server error log
EXEC xp_readerrorlog 0, 1

-- listing the content of the second SQL Server error log
EXEC xp_readerrorlog 1, 1

-- listing the content of the current SQL Server Agent log
EXEC xp_readerrorlog 0, 2

-- searching for errors 
EXEC xp_readerrorlog 0, 1, N'error'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'error', N'CHECKDB'

-- searching for errors that have to do with consistency checks
EXEC xp_readerrorlog 0, 1, N'failed', N'backups'

-- searching for warnings 
EXEC xp_readerrorlog 0, 1, N'warning'

-- searching who killed a session
EXEC xp_readerrorlog 0, 1, N'kill'

-- searching for I/O information
EXEC xp_readerrorlog 0, 1, N'I/O'

-- searching for consistency checks 
EXEC xp_readerrorlog 0, 1, N'CHECKDB'

-- searching for consistency checks performed via DBCC
EXEC xp_readerrorlog 0, 1, N'DBCC CHECKDB'

-- searching for failed logins  
EXEC xp_readerrorlog 0, 1, N'Login failed'

-- searching for 
EXEC xp_readerrorlog 0, 1, N'[INFO]'

-- searching for shutdowns 
EXEC xp_readerrorlog 0, 1, N'shutdown'

-- searching for a database instant file initialization event  
EXEC xp_readerrorlog 0, 1, N'database instant file initialization'

   If the error log is too big it’s important to narrow the search for a given time interval:

-- searching for errors starting with a given date 
DECLARE @StartDate as Date = DateAdd(d, -1, GetDate())
EXEC xp_readerrorlog 0, 1, N'error', N'', @StartDate

-- searching for errors within a time interval 
DECLARE @StartDate as Date = DateAdd(d, -14, GetDate())
DECLARE @EndDate as Date = DateAdd(d, -7, GetDate())
EXEC xp_readerrorlog 0, 1, N'', N'', @StartDate, @EndDate, N'desc' 

   The output can be dumped into a table especially when is needed to perform a detailed analysis on the error log. It might be interesting to check how often an error message occurred, like in the below example. One can take thus advantage of more complex pattern searching.

-- creating the error log table 
CREATE TABLE dbo.ErrorLogMessages (
    LogDate datetime2(0) 
  , ProcessInfo nvarchar(255)
  , [Text] nvarchar(max))

-- loading the errors 
INSERT INTO dbo.ErrorLogMessages
EXEC xp_readerrorlog 0, 1

-- checking the results 
SELECT *
FROM dbo.ErrorLogMessages

-- checking messages frequency 
SELECT [Text]
, count(*) NoOccurrences
, Min(LogDate) FirstOccurrence
FROM dbo.ErrorLogMessages
GROUP BY [Text]
HAVING count(*)>1
ORDER BY NoOccurrences DESC

-- getting the errors and their information 
SELECT *
FROM (
 SELECT *
 , Lead([Text], 1) OVER (PARTITION BY LogDate, ProcessInfo ORDER BY LogDate) PrevMessage
 FROM dbo.ErrorLogMessages
 ) DAT
WHERE [Text] LIKE '%error:%[0-9]%'

-- cleaning up 
--DROP TABLE IF EXISTS dbo.ErrorLogMessages 

   For those who don’t have admin permissions it is necessary to explicitly give execute permissions on the xp_readerrorlog stored procedure:

-- giving explicit permissions to account
GRANT EXECUTE ON xp_readerrorlog TO [<account_name>]

   Personally, I’ve been using the stored procedure mainly to check whether error messages were logged for a given time interval and whether the consistency checks run without problems. Occasionally, I used it to check for failed logins or sessions terminations (aka kills).

Notes:
Microsoft warns that undocumented objects might change in future releases. Fortunately, xp_readerrorlog made it since SQL Server 2005 to SQL Server 2017, so it might make it further…
The above code was tested also on SQL Server 2017.

Happy coding!
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.