Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

26 February 2024

R Language: Data Summaries without Using a DataFrame

Coming back to the R language after several years and trying to remember some basic functions proved to be a bit challenging, even if the syntax is quite simple. Therefore, I considered putting together a few calls as refresher based on Youden-Beale data. To run the below code you'll need to install the R language and RStudio.

In case you don't have the package installed, run the next two lines:

install.packages("ACSWR") #install the Youden-Beale Experiment package
library(ACSWR)	#load the library
 
str(yb)		#display datasets' structure

  'data.frame': 8 obs. of 2 variables:
$ Preparation_1: int  31  20  18  17  9  8 10  7
$ Preparation_2: int  18  17  14  11 10 7   5  6

yb		#display the dataset

Preparation_1 Preparation_2
1          31                  18
2          20                  17
3          18                  14
4          17                  11
5            9                  10
6            8                   7
7          10                   5
8            7                   6

summary(yb) 	#display the summary for whole dataset

Preparation_1     Preparation_2
Min. : 7.00          Min. : 5.00
1st Qu.: 8.75       1st Qu.: 6.75
Median :13.50     Median :10.50
Mean :15.00        Mean :11.00
3rd Qu.:18.50      3rd Qu.:14.75
Max. :31.00         Max. :18.00

summary(yb$Preparation_1)	#display the summary for first column

Min. 1st Qu. Median   Mean   3rd Qu.   Max.
7.00      8.75     13.50   15.00     18.50    31.00

summary(yb$Preparation_2)	#display the summary for second column

Min. 1st Qu. Median    Mean   3rd Qu.  Max.
5.00     6.75      10.50    11.00     14.75   18.00

min(yb)	#display the minimum value for the whole dataset

[1] 5

min(yb$Preparation_1)	#display the mininun of first column

[1] 7

min(yb$Preparation_2)	#display the minimum of second column

[1] 5

sum(yb)	#display the sum of all values

[1] 208

sum(yb$Preparation_1)	#display the sum of first column

[1] 120

sum(yb$Preparation_2)	#display the sum of second column

[1] 88

#display the percentiles 
quantile(yb$Preparation_1,seq(0,1,.25))

0%    25%   50%   75%   100%
7.00  8.75  13.50  18.50  31.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0%   25%   50%   75%   100%
5.00  6.75 10.50  14.75   18.00

#display the percentiles 
quantile(yb$Preparation_2,seq(0,1,.25))

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  100%
7.0    7.7     8.4    9.1     9.8  13.5   17.2  17.9  19.2   23.3   31.0

quantile(yb$Preparation_2,seq(0,1,.1))

0%   10%   20%  30%   40% 50%  60% 70%  80%  90% 100%
5.0     5.7     6.4      7.3     9.4 10.5   11.6 13.7  15.8   17.3  18.0

length(yb) 	#display the number of items 
ncol(yb) 	#display the number of columns

[1] 2

sort(yb$Preparation_1) #display the sorted values ascendingly 

[1] 7 8 9 10 17 18 20 31

sort(yb$Preparation_1, decreasing = TRUE)

[1] 31 20 18 17 10 9 8 7

#display a vertical poxplot
boxplot(yb, notch=FALSE)
title("A: Vertical Boxplot for Youden-Beale Data")

#display an horizontal poxplot
boxplot(yb, horizontal = TRUE)
title("B: Horizontal Boxplot for Youden-Beale Data")


 
plot(yb) #scatter diagram
title("Scatter diagram")

lsfit(yb$Preparation_1, yb$Preparation_2)$coefficients #list square fit coefficients 

Intercept         X 
2.8269231 0.5448718 
 
lsfit(yb$Preparation_1, yb$Preparation_2)$residuals #list square fit residuals

[1] -1.7179487  3.2756410  1.3653846 -1.0897436  2.2692308 -0.1858974
[7] -3.2756410 -0.6410256

  Happy coding!

22 August 2023

Book Review: Laurent Bossavit's The Leprechauns of Software Engineering (2015)




Software Engineering should be the "establishment and use of sound engineering principles to obtain economically software that is reliable and works on real machines efficiently" [2]. Working for more than 20 years in the field I feel sometimes that its foundation is a strange mix of sound and questionable ideas that take the form of methodologies, principles, standards, myths, folklore, statistics and other similar concepts that form its backbone.

I tend to look with critical eyes at the important numbers advanced in research and pseudo-scientific papers especially when they’re related to my job, this because I know that statistics are seldom what they appear to be - there are accidental and sometimes even intended errors made to support the facts. Unfortunately, the missing row data and often the information about the methodologies used in collecting and processing the respective data make numbers and/or graphics' understanding more challenging, not to mention the considerable amount of effort and time spent to uncover the evidence trail.
Fortunately, there are other professionals who went further down the path of bibliographical references and shared their findings in blogs, papers, books and other media content. It’s also the case of Laurent Bossavit, who in his book, "The Leprechauns of Software Engineering" (2015), looks behind some of the numbers that over time become part of the leprechaunish folklore of IT professionals, puts them into the historical context and provides in appendix the evidence trails for the reader to validate his findings. Over several chapters the author focuses mainly on the cost of defects, Boehm’s cone of uncertainty, the differences in productivity amount individual programmers (aka 10x claim), respectively the relation between poor requirements and defects.

His most important finding is that the references used in most of the researched sources advancing the above numbers were secondary, while the actual sources provide no direct information of empirical data or the methodology for its collection. The way the numbers are advanced and used makes one question the validity of the measurements performed, respectively the character of the mistakes the authors made. Many of the cited papers hardly match the academic requirements of other scientific fields, being a mix of false claims, improperly conducted research and citations.

Secondly, he argues that the small sample sizes used as basis for the experiments, the small population formed usually of students, respectively the way numbers were mixed without any reliable scientific character makes him (and the reader as well) question even more how the experiments were performed in the respective papers. With this, it is more likely that a bigger number of research based on these sources should raise further concerns. The reader can thus ask himself/herself how deep the domino effect goes inside of the Software Engineering field.

In author’s opinion Software Engineering as social process "needs to be studied with tools that borrow as much from the social and cognitive sciences as they do from the mathematical theories of computation". How much is possible to extend the theories and models of the respective fields is an open topic. The bottom line, the field of Software Engineering needs better and scientific empirical experiments that are based on commonly agreed definitions, data collection and processing techniques, respectively higher standards for research publications. Without this, we’ll continue to compare apples with peaches and mix them in calculations so we can get some stories that support our leprechaunish theories.

Overall, the book is a good read for software engineers as well as for other IT professionals. Even if it barely scratched the surface of software myths and folklore, there’s enough material for the readers who want to dive deeper.

Previous Post <<||>> Next Post

References:
[1] Laurent Bossavit (2015) "The Leprechauns of Software Engineering"
[2] Friedrich Bauer (1972) "Software Engineering", Information Processing

22 February 2023

SQL Reloaded: Automatic Statistics Creation & Dropping for CETAS based on CSV File Format in Serverless SQL Pool

Introduction

The serverless SQL pool query optimizer uses statistics to calculate and compare the cost of various query plans, and then choose the plan with the lowest cost. Automatic creation of statistics is turned on for parquet file format, though for CSV file format statistics will be automatically created only when OPENROWSET is used. This means that when creating CETAS based on CSV the statistics need to be created manually. 

This would be one more reason for holding the files in the Data Lake as parquet files. On the other side there are also many files already available in CSV format, respectively technoloqies that allows exporting data only/still as CSV. Moreover, transforming the files as parquet is not always technically feasible.

Using OPENROWSET could also help, though does it make sense to use a different mechanismus for the CSV file format? In some scenarios will do. I prefer to have a unitary design, when possible. Moreover, even if some columns are not needed, they can still be useful for certain scenarios (e.g. troubleshooting, reevaluating their use, etc.). 

There are files, especially the ones coming from ERPs (Enterprise Resource Planning) or similar systems, which have even a few hundred columns (on average between 50 and 100 columns). Manually creating  the statistics for the respective tables will cost lot of time and effort. To automate the process there are mainly three choices:
(1) Creating statistics for all the columns for a given set of tables (e.g. for a given schema).
(2) Finding a way to automatically identify the columns which are actually used.
(3) Storing the list of tables and columns on which statistics should be build (however the list needs to be maintained manually). 

Fortunately, (1) can be solved relatively easy, based on the available table metadata, however it's not the best solution, as lot of statistics will be unnecessarily created. (2) is possible under certain architectures or additional effort. (3) takes time, though it's also an approachable solution.

What do we need?

For building the solution, we need table and statistics metadata, and the good news is that the old SQL Server queries still work. To minimize code's repetition, it makes sense to encapsulate the logic in views. For table metadata one can use the sys.objects DMV as is more general (one can replace sys.objects with sys.tables to focus only on tables):

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectColumns

-- create view
CREATE OR ALTER VIEW dbo.vObjectColumns
AS 
-- object-based column metadata
SELECT sch.name + '.' + obj.name two_part_name
, sch.Name schema_name
, obj.name object_name
, col.name column_name
, obj.type
, CASE 
	WHEN col.is_ansi_padded = 1 and LEFT(udt.name , 1) = 'n' THEN col.max_length/2
	ELSE col.max_length
  END max_length
, col.precision 
, col.scale
, col.is_nullable 
, col.is_identity
, col.object_id
, col.column_id
, udt.name as data_type
, col.collation_name
, ROW_NUMBER() OVER(PARTITION BY col.object_id ORDER BY col.column_id) ranking FROM sys.columns col JOIN sys.types udt on col.user_type_id= udt.user_type_id JOIN sys.objects obj ON col.object_id = obj.object_id JOIN sys.schemas as sch on sch.schema_id = obj.schema_id -- testing the view SELECT obc.* FROM dbo.vObjectColumns obc WHERE obc.object_name LIKE '<table name>%' AND obc.schema_name = 'CRM' AND obc.type = 'U' ORDER BY obc.two_part_name , Ranking

The view can be used also as basis for getting the defined stats:

-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vAdminObjectStats

-- create view 
CREATE OR ALTER VIEW dbo.vObjectStats
AS
-- object-based column statistics
SELECT obc.two_part_name + '.' + QuoteName(stt.name) three_part_name
, obc.two_part_name
, obc.schema_name
, obc.object_name
, obc.column_name
, stt.name stats_name
, STATS_DATE(stt.[object_id], stt.stats_id) AS last_updated
, stt.auto_created
, stt.user_created
, stt.no_recompute
, stt.has_filter 
, stt.filter_definition
, stt.is_temporary 
, stt.is_incremental 
, stt.auto_drop 
, stt.stats_generation_method_desc
, stt.[object_id]
, obc.type 
, stt.stats_id
, stc.stats_column_id
, stc.column_id
FROM dbo.vObjectColumns obc
     LEFT JOIN sys.stats_columns stc 
	   ON stc.object_id = obc.object_id
	  AND stc.column_id = obc.column_id 
          LEFT JOIN sys.stats stt
            ON stc.[object_id] = stt.[object_id] 
           AND stc.stats_id = stt.stats_id

-- testing the view 
SELECT * FROM dbo.vObjectStats obs WHERE (obs.auto_created = 1 OR obs.user_created = 1) AND obs.type = 'U' AND obs.object_name = '<table name>' ORDER BY obs.two_part_name , obs.column_id

Now we have a basis for the next step. However, before using the stored procedure define below, one should use the last query and check whether statistics were defined before on a table. Use for testing also a table for which you know that statistics are available.

Create Statistics

The code below is based on a similar stored procedure available in the Microsoft documentation (see [1]). It uses a table's column metadata, stores them in a temporary table and then looks through each record, create the DDL script and runs it:

-- drop procedure (for cleaning)
--DROP PROCEDURE dbo.pCreateStatistics

-- create stored procedure
CREATE OR ALTER PROCEDURE dbo.pCreateStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(100)
)
AS
-- creates statistics for serverless SQL pool
BEGIN
	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	  schema_name nvarchar(50)
	, table_name nvarchar(128)
	, column_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obc.schema_name
	, obc.object_name
	, obc.column_name 
	, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
	FROM dbo.vObjectColumns obc
	WHERE obc.type = 'U' -- tables
	  AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	  AND IsNull(@table_name, obc.object_name) = obc.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
		SET @query = (SELECT 'CREATE STATISTICS '+ QUOTENAME('stat_' + schema_name + '_' + table_name + '_' + column_name) + ' ON '+ QUOTENAME(schema_name) + '.' + QUOTENAME(table_name) + '(' + QUOTENAME(column_name) + ')' 
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'create failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END


-- test stored procedure (various scenario)
EXEC dbo.pCreateStatistics '<schema name>', '<table name>' -- based on schema & table
EXEC dbo.pCreateStatistics '<schema name>', NULL -- based on a schema
EXEC dbo.pCreateStatistics NULL, '<table name>' -- based on a table

Notes:
IMPORTANT!!! I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Please note that rerunning the stored procedure without deleting previously the statitics on the tables in scope will make the procedure raise failures for each column (behavior by design), though the error messages can be surpressed by commenting the code, if needed. One can introduce further validation, e.g. considering only the columns which don't have a statistic define on them.

Further Steps?

What can we do to improve the code? It would be great if we could find a way to identify the columns which are used in the queries. It is possible to retrieve the queries run in serverless SQL pool, however identifying the tables and columns from there or a similar source is not a straightforward solution. 

The design of views based on the external tables can help in the process! I prefer to build on top of the external tables a first level of views (aka "base views") that include only the fields in use (needed by the business) ordered and "grouped" together based on their importance or certain characteristics. The views are based solely on the external table and thus contain no joins. They can include conversions of data types, translations of codes into meaningful values, and quite seldom filters on the data. However, for traceability the name of the columns don't change! This means that if view's name is easily identifiable based on external table's name, we could check view's columns against the ones of the external table and create statistics only for the respective columns. Using a unique prefix (e.g. "v") to derive views' name from tables' name would do the trick.

To do that, we need to create a view that reflects the dependencies between objects (we'll be interested only in external tables vs views dependencies):

-- drop view (for cleaning)
-- DROP VIEW IF EXISTS dbo.vObjectsReferenced

-- create view
CREATE OR ALTER VIEW dbo.vObjectsReferenced
AS 
-- retrieving the objects referenced 
SELECT QuoteName(sch.name) + '.' + QuoteName(obj.name) AS two_part_name 
, obj.object_id 
, obj.schema_id 
, sch.name schema_name 
, obj.name object_name 
, obj.type
, QuoteName(scr.name) + '.'+ QuoteName(sed.referenced_entity_name) AS ref_two_part_name 
, obr.object_id ref_object_id
, obj.schema_id ref_schema_id 
, scr.name ref_schema_name 
, obr.name ref_object_name 
, obr.type ref_type
FROM sys.sql_expression_dependencies sed 
     JOIN sys.objects obj
       ON obj.object_id = sed.referencing_id 
	      JOIN sys.schemas as sch
	        ON obj.schema_id = sch.schema_id 
	 JOIN sys.objects obr
	   ON sed.referenced_id = obr.object_id
	      JOIN sys.schemas as scr
	        ON obr.schema_id = scr.schema_id

-- testing the view
SELECT top 10 *
FROM dbo.vObjectsReferenced
WHERE ref_type = 'U'

With this, the query used above to fill the table becomes:

-- fill table query with column selection 
SELECT obc.schema_name
, obc.object_name
, obc.column_name 
, ROW_NUMBER() OVER(ORDER BY obc.schema_name, obc.object_name) ranking
FROM dbo.vObjectColumns obc
WHERE obc.type = 'U' -- tables
	AND IsNull(@schema_name, obc.schema_name) = obc.schema_name 
	AND IsNull(@table_name, obc.object_name) = obc.object_name
	AND EXISTS ( -- select only columns referenced in views
	    SELECT * 
		FROM dbo.vObjectsReferenced obr 
		    JOIN dbo.vAdminObjectColumns obt
			ON obr.object_id = obt.object_id 
		WHERE obt.type = 'V' -- view
		AND obr.object_name  =  'v' + obr.ref_object_name
		AND obc.object_id = obr.ref_object_id
		AND obc.column_name = obt.column_name);

This change will reduce the number of statistics created on average by 50-80%. Of course, there will be also cases in which further statistics need to be added manually. One can use this as input for an analysis of the columns used and store the metadata in a file, do changes to it and base on it statistics' creation. 

Drop Statistics

Dropping the indexes resumes to using the dbo.vObjectStats view created above for the schema and/or table provided as parameter. The logic is similar to statistics' creation:

-- drop stored procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS dbo.pDropStatistics

-- create procedure
CREATE OR ALTER PROCEDURE dbo.pDropStatistics
(   @schema_name nvarchar(50)
,   @table_name nvarchar(128)
)
AS
-- drop statistics for a schema and/or external table in serverless SQL pool
BEGIN

	DECLARE @query as nvarchar(1000) = ''
	DECLARE @index int = 1, @nr_records int = 0

	-- drop temporary table if it exists 
	DROP TABLE IF EXISTS #stats_ddl;

	-- create temporary table 
	CREATE TABLE #stats_ddl( 
	 three_part_name nvarchar(128)
	, ranking int
	);

	-- fill table
	INSERT INTO #stats_ddl
	SELECT obs.three_part_name
	, ROW_NUMBER() OVER(ORDER BY obs.three_part_name) ranking
	FROM dbo.vObjectStats obs
	WHERE obs.type = 'U' -- tables
	  AND IsNull(@schema_name, obs.schema_name) = obs.schema_name 
	  AND IsNull(@table_name, obs.object_name) = obs.object_name

	SET @nr_records = (SELECT COUNT(*) FROM #stats_ddl)

	WHILE @index <= @nr_records
	BEGIN
   
		SET @query = (SELECT 'DROP STATISTICS ' + ddl.three_part_name
			   FROM #stats_ddl ddl
			   WHERE ranking = @index);

		BEGIN TRY
		        -- execute ddl
			EXEC sp_executesql @query;
		END TRY
		BEGIN CATCH
			SELECT 'drop failed for ' + @query;
		END CATCH

		SET @index+=1;
	END

	DROP TABLE #stats_ddl;
END

Note:
IMPORTANT!!!
I recommend testing the stored procedure in a test environment first for a few tables and not for a whole schema. If there are too many tables, this will take time.

Closing Thoughts

The solution for statistics' creation is not perfect, though it's a start! It would have been great if such a feature would be provided by Microsoft, and probably they will, given the importance of statistics of identifying an optimal plan. It would be intersting to understand how much statistics help in a distributed environment and what's the volume of data processed for this purpose. 

Please let me know if you found other workarounds for statistics's automation.

Happy coding!

References:
[1] Microsoft Learn (2022) Statistics in Synapse SQL (link)

13 January 2021

SQL Server Administration: Undocumented IV (DBCC SQLPERF)

Besides the documented LOGSPACE parameter (see previous post), DBCC SQLPERF utility has several undocumented parameters which allow providing statistics about schedulers, threads, spinlocks, IO, network, read-aheads, respectively waits.  

Scheduler Statistics

By providing 'umsstats' as parameter, the utility returns as result the visible UMS schedulers on the system:

-- visible UMS schedulers
DBCC SQLPERF(umsstats)

Output:
StatisticValue
Node Id0
Avg Sched Load5
Sched Switches6903
Sched Pass6721358
IO Comp Passes11334
Scheduler ID0
    online1
    num tasks6
    num runnable0
    num workers9
    active workers6
    work queued0
    cntxt switches7125444
    cntxt switches(idle)9898903
    preemptive switches2304
Scheduler ID1
    online1
    num tasks6
    num runnable0
    num workers9
    active workers5
    work queued0
    cntxt switches3370432
    cntxt switches(idle)4427991
    preemptive switches22729

The fields have the following meaning: 
StatisticDescription
Node Id
Avg Sched Load
Sched SwitchesThe number of switches between schedulers
Sched Pass
IO Comp Passes
Scheduler IDThe scheduler's zero-based ID number
num tasksThe number of tasks associated with the scheduler
num runnableThe number of workers on the runnable list
num workersThe total number of workers associated with the scheduler
idle workersThe number of idle workers
work queuedThe number of items waiting to be processed in the work queue
cntxt switchesThe number of switches between workers for the scheduler
cntxt switches(idle)The number of times the idle loop was switched into

The functionality is useful when one suspects that there are issues related to the visible schedulers.

Detailed information about the schedulers can be found also via the sys.dm_os_schedulers DMV.

Threads Statistics

By providing 'threads' as parameter, the utility returns as result the threads created in the system:

--  thread statistics
DBCC SQLPERF(threads)

Output (just the first records):
SpidThread IDStatusLoginNameIOCPUMemUsage
111228backgroundsa000
212572backgroundsa000
312576backgroundsa000
411884backgroundsa000
512964backgroundsa000
612960backgroundsa004
712968backgroundsa000

Detailed information about the schedulers can be found also via the sys.dm_os_threads DMV.

IO Statistics

By providing 'iostats' as parameter, the utility returns as result a count of the outstanding reads, respectively writes:

--  IO statistics
DBCC SQLPERF(iostats)

Output:
StatisticValue
Reads Outstanding0
Writes Outstanding0

Network Statistics

By providing 'netstats' as parameter, the utility returns as result network-related statistics:

--  network statistics
DBCC SQLPERF(netstats)

Output:
StatisticValue
Network Reads6976
Network Writes9036
Network Bytes Read5318957
Network Bytes Written2,222512E+07
Command Queue Length0
Max Command Queue Length0
Worker Threads0
Max Worker Threads0
Network Threads0
Max Network Threads0

Read Ahead Statistics

By providing 'rastats' as parameter, the utility returns read-ahead statistics:

--  read ahead statistics
DBCC SQLPERF(rastats)

Output
StatisticValue
RA Pages Found in Cache0
RA Pages Placed in Cache0
RA Physical IO0
Used Slots0

Spinlock Statistics

By providing 'spinlockstats' as parameter, the utility returns the spinlock statistics, where a spinlock is a a lightweight synchronization object used to serialize access to data structures which are typically held for a short period of time:

--  spinlock statistics
DBCC SQLPERF(spinlockstats)

Output:
Spinlock NameCollisionsSpinsSpins/CollisionSleep Time (ms)Backoffs
LOCK_RW_TEST00000
LOCK_RW_SECURITY_CACHE711210500296,0619054
LOCK_RW_CMED_HASH_SET5175035001
LOCK_RW_ABTX_HASH_SET00000
LOCK_RW_RBIO_REQ00000

Detailed information about the spinlock stats can be found also via the sys.dm_os_spinlock_stats DMV.

Wait Statistics

By providing 'waitstats' as parameter, the utility returns the available wait statistics:

-- wait statistics 
DBCC SQLPERF(waitstats)

Output (only a few records):
Wait TypeRequestsWait TimeSignal Wait Time
LCK_M_SCH_M185973
LCK_M_S13168950
PAGEIOLATCH_SH17893856825
PAGEIOLATCH_UP1181010
PAGEIOLATCH_EX7361049016

Detailed information about the wait stats can be found also via sys.dm_os_wait_stats DMV.

Notes:
As Microsoft warns, the undocumented features shouldn't be used into production environments as they will be deprecated in future versions. Instead should be used the documented DMVs, when available. 
All objects mentioned above require VIEW SERVER STATE permissions.
The DBCC SQLPERF utility allows resetting the latch, spinlock, respectively the wait statistics by providing the following parameters (see the SQL Docs for more information): 

-- resetting the latch statistics
DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR)

-- resetting the spinlock statistics
DBCC SQLPERF ('sys.dm_os_spinlock_stats', CLEAR);  

-- resetting the wait statistics
DBCC SQLPERF('sys.dm_os_latch_stats', CLEAR)

DBCC PERFMON provides in a single call the IO, network, read ahead, spinlocks, respectively the wait statistics in distinct resultsets:

-- performance statistics
DBCC PERFMON

03 January 2020

Data Management: Data Literacy (Part I: A Second Language)

Data Management

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.  

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.