SQL Troubles

17 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part III: Aggregate Functions)

As their names denote, aggregate functions are used to summarize information across the whole data set or based on several attributes. The following queries exemplify the use of aggregates based on the tables created in a previous post:

-- aggreates over the whole table 
SELECT COUNT(*) NoRecords 
, MIN(StartDate) StartDate
, MAX(EndDate) EndDate
, DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
FROM dbo.T_Allocations A

-- aggregates based on several several attributes 
 SELECT A.CourseId
 , A.StudentId 
 , COUNT(*) NoRecords 
 , MIN(StartDate) StartDate
 , MAX(EndDate) EndDate
 , DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
 FROM dbo.T_Allocations A
GROUP BY A.CourseId
 , A.StudentId

Correlated subqueries can be of help either inside of a SELECT or an CROSS, respectively OUTER APPLY:

-- correlated subquery within a SELECT 
 SELECT *
 , (SELECT COUNT(*) NoRecords 
     FROM dbo.T_Allocations A
     WHERE C.CourseId = A.CourseId
     GROUP BY A.CourseId) NoRecords 
 FROM dbo.T_Courses C


 -- CROSS APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      CROSS APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

-- OUTER APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      OUTER APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

Aggregates are useful in a series of techniques like the listing of duplicates:

-- identifying duplicates via an aggregate 
SELECT A.*
FROM dbo.T_Allocations A
     JOIN (
  SELECT A.CourseId
  , A.StudentId 
  , COUNT(*) NoRecords 
  FROM dbo.T_Allocations A
  GROUP BY A.CourseId
  , A.StudentId 
  HAVING COUNT(*)>1
  ) DUP
   ON A.CourseId = DUP.CourseId 
  AND A.StudentId = DUP.Studentid
ORDER BY A.CourseId
, A.StudentId

A similar technique can be used to remove the duplicates from the base table:

-- removing the duplicates except the last record  
 DELETE dbo.T_Allocations 
 FROM (
 SELECT A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 , Max(A.AllocationId) AllocationId
 FROM dbo.T_Allocations A
 GROUP BY A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 HAVING count(*)>1
  ) A
WHERE dbo.T_Allocations.CourseId = A.CourseId
  AND dbo.T_Allocations.StudentId = A.StudentId
  AND dbo.T_Allocations.StartDate = A.StartDate
  AND dbo.T_Allocations.EndDate = A.EndDate
  AND dbo.T_Allocations.AllocationId <> A.AllocationId

When the number of values is fixed and unique within a grouping, one can use aggregate functions to display the values within a matrix:

-- Matrix display via aggregates 
SELECT S.StudentId 
, S.StudentName 
, CASE WHEN A.Course1 = 1 THEN 'x' ELSE '-' END Course1
, CASE WHEN A.Course2 = 1 THEN 'x' ELSE '-' END Course2
, CASE WHEN A.Course3 = 1 THEN 'x' ELSE '-' END Course3
, CASE WHEN A.Course4 = 1 THEN 'x' ELSE '-' END Course4
, CASE WHEN A.Course5 = 1 THEN 'x' ELSE '-' END Course5
FROM dbo.T_Students S
     LEFT JOIN (
  SELECT A.StudentId 
  , Max(CASE WHEN A.CourseId = 1 THEN 1 ELSE 0 END) Course1
  , Max(CASE WHEN A.CourseId = 2 THEN 1 ELSE 0 END) Course2
  , Max(CASE WHEN A.CourseId = 3 THEN 1 ELSE 0 END) Course3
  , Max(CASE WHEN A.CourseId = 4 THEN 1 ELSE 0 END) Course4
  , Max(CASE WHEN A.CourseId = 5 THEN 1 ELSE 0 END) Course5
  FROM dbo.T_Allocations A
  GROUP BY A.StudentId
    ) A
   ON S.StudentId = A.StudentId 
ORDER BY S.StudentName

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part II: Joins and Subqueries)

The basis for being able to manipulate data via SQL scripting is a good knowledge of using joins and subqueries as seems fit for the purpose and data models. The following scripts are based on the tables created in a previous post.

-- FULL JOIN
SELECT *
FROM dbo.T_Courses C
     JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL JOIN (deprecated)
 SELECT *
 FROM dbo.T_Courses C
 , dbo.T_Students S
 WHERE C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

--LEFT JOIN
 SELECT *
 FROM dbo.T_Courses C
      LEFT JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 WHERE S.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- RIGHT JOIN
 SELECT *
 FROM dbo.T_Courses C
      RIGHT JOIN dbo.T_Students S
       ON C.CourseId = S.CourseId 
 WHERE C.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL OUTER JOIN
 SELECT *
 FROM dbo.T_Courses C
      FULL OUTER JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 --WHERE C.CourseId IS NULL
 --WHERE S.CourseId IS NULL
 --WHERE (C.CourseId IS NULL OR S.StudentId IS NULL)
 --WHERE (C.CourseId IS NULL AND S.StudentId IS NULL)
 ORDER BY C.CourseName
 , S.StudentName

The IN, NOT IN, EXISTS and NOT EXISTS allow using correlated queries, their use being indicated when there are no actual data needed from the tables involved in the correlated queries:

-- EXISTS (correlated subquery)
SELECT *
FROM dbo.T_Courses C
WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
ORDER BY C.CourseName

-- NOT EXISTS (correlated subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
 ORDER BY C.CourseName

-- IN (subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE CourseId IN (SELECT CourseId 
        FROM dbo.T_Students S)
 ORDER BY C.CourseName

Joining multiples tables is done using the same principles as above:

-- joins with more tables 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
      JOIN dbo.T_Courses C
        ON A.CourseId = C.CourseId 
      JOIN dbo.T_Students S
        ON A.StudentId = S.StudentId 
 ORDER BY C.CourseName 
 , S.StudentName

One can obtain the same result via correlated subqueries (a technique often met between Oracle developers). From readability reasons I avoid writing such queries, unless there’s a special purpose to do so.

-- correlated subquery for individual values
SELECT A.CourseId 
, (SELECT C.CourseName 
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT S.StudentName 
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName

When displaying values within a SELECT via a correlated subqueries, some developers feel the need to use MAX or MIN functions to make sure only one value will be returned. For data analysis it may be acceptable, however if the data model imposes it, then a redesign of the solution is more likely necessary.

-- correlated subquery for individual values
 SELECT A.CourseId 
 , (SELECT Max(C.CourseName)
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT Max(S.StudentName)
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
 FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName

Another technique not recommended is displaying one or more attributes from the same table with the same conditions via individual correlated queries. The use of aggregate functions is more appropriate however with numerical or date values.

Notes:
The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part I: Inserts)

Independently on whether one is a database developer, administrator, or any other professional in the data area, as part of the job is important to be able to manipulate data on the fly. From the availability of DTS (Data Transformation Services) replaced since SQL Server 2005 with SSIS (SQL Server Integration Services), the use of packages for data import become one of the simplest and fastest ways of importing data into a database. It still is, however when dealing with small volume of data, it’s easier to import the data over a few SQL scripts, like in the below examples.
The old-fashion way of loading data is by defining first the target table based on the source’s definition:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

CREATE TABLE [dbo].[T_Courses](
     [CourseId] [int] NOT NULL,
     [CourseName] [varchar](50) NOT NULL
 ) ON [PRIMARY]

-- old-fashioned invidual inserts(SQL Server 2000)
INSERT INTO dbo.T_Courses
VALUES (1, 'Course 1')

INSERT INTO dbo.T_Courses(CourseId, CourseName)
VALUES (2, 'Course 2')

-- inserting multiple rows to a table at once (SQL Server 2005)
INSERT INTO dbo.T_Courses
--OUTPUT inserted.CourseId, inserted.CourseName -- 
VALUES (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')

 SELECT *
 FROM dbo.T_Courses

Sometimes the definition of the table is not important. With SQL Server 2008 it was possible to create the table on the fly, and thus the inserts can be simplified as follows:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

SELECT *
INTO dbo.T_Courses
FROM (
  VALUES (1, 'Course 1')
 , (2, 'Course 2')
 , (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')
 ) C (CourseId, CourseName)


 SELECT *
 FROM dbo.T_Courses

Now, to demonstrate further techniques it is useful to define besides Courses also a Students, respectively an Allocation table, the later storing the relation between a Course and a Student. As can be seen there is in the Students table a reference to the Courses, showing the course in which the student is enrolled currently The model is only for demonstration purposes:

-- Student data with current courses
-- DROP TABLE IF EXISTS dbo.T_Students

SELECT StudentId
 , StudentName
 , CourseId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 INTO dbo.T_Students
 FROM (
  VALUES (1, 'Student 1', 1, '20170201', '20170228')
 , (2, 'Student 2', 1, '20170201', '20170228')
 , (3, 'Student 3', 2, '20170201', '20170228')
 , (4, 'Student 4', 3, '20170201', '20170228')
 , (5, 'Student 5', NULL, NULL, NULL)
 , (6, 'Student 6', NULL, NULL, NULL)
 , (7, 'Student 7', NULL, NULL, NULL)
 ) S (StudentId, StudentName, CourseId, StartDate, EndDate)


 SELECT *
 FROM dbo.T_Students

-- Course allocations 
-- DROP TABLE IF EXISTS dbo.T_Allocations

SELECT CourseId
 , StudentId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 , Hours
 , Mark 
 INTO dbo.T_Allocations
 FROM (
  VALUES  (1, 1, '20170101', '20170131', 4, 1)
 , (1, 2, '20170101', '20170131', 5, 2)
 , (1, 3, '20170101', '20170131', 12, 3)
 , (2, 1, '20170201', '20170128', 23, 1)
 , (2, 3, '20170201', '20170128', 12, 3)
 , (2, 4, '20170201', '20170128', 21, 1)
 , (1, 5, '20170201', '20170128', 12, 1)
 , (4, 1, '20170101', '20170131', 5, 2)
 , (4, 1, '20170101', '20170131', 4, 2)
 ) A(CourseId, StudentId, StartDate, EndDate, Hours, Mark)

ALTER TABLE dbo.T_Allocations 
ADD AllocationId int IDENTITY(1,1) 

SELECT *
FROM dbo.T_Allocations

With this the data model is ready for use. By the initial design the courses in which the Students are currents enrolled were not supposed to be into the Allocations table. The following query repairs this:

-- insert data from query
INSERT INTO dbo.T_Allocations (CourseId, StudentId, StartDate, EndDate, Hours)
SELECT C.CourseId
, C.StudentId
, C.StartDate
 , C.EndDate
 , 0 Hours
 FROM dbo.T_Students C
      LEFT JOIN dbo.T_Allocations A
        ON C.CourseId = A.CourseId
       AND C.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL
   AND C.CourseId IS NOT NULL

If is needed to append more data, one can modify the corresponding query from above as follows:

-- insert data from values with check against existing data 
INSERT INTO dbo.T_Allocations
SELECT DAT.CourseId
 , DAT.StudentId
 , Cast(DAT.StartDate as Date) StartDate
 , Cast(DAT.EndDate as Date) EndDate
 , DAT.Hours
 , DAT.Mark 
 FROM (
  VALUES  (5, 5, '20170101', '20170131', 4, 2)
 , (5, 5, '20170101', '20170131', 5, 3)
 , (5, 5, '20170101', '20170131', 12, 4)
 , (4, 1, '20170101', '20170131', 4, 2) --already exists 
 ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
   LEFT JOIN dbo.T_Allocations A
     ON DAT.CourseId = A.CourseId
    AND DAT.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL

Starting with the 2008 version, a MERGE statement was introduced in SQL Server, which allows adding further behavior within the same statement for INSERTS, UPDATES or DELETES. The above insert based on a MERGE becomes:

-- insert data via a merge
 MERGE INTO dbo.T_Allocations A USING
 (
     SELECT DAT.CourseId
     , DAT.StudentId
     , Cast(DAT.StartDate as Date) StartDate
     , Cast(DAT.EndDate as Date) EndDate
     , DAT.Hours
     , DAT.Mark 
     FROM (
      VALUES  (5, 5, '20170101', '20170131', 4, 2)
     , (5, 5, '20170101', '20170131', 5, 3)
     , (5, 5, '20170101', '20170131', 12, 4)
     , (4, 1, '20170101', '20170131', 4, 2) --already exists 
     ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 ) DAT
   ON DAT.CourseId = A.CourseId
  AND DAT.StudentId = A.Studentid 
 WHEN NOT MATCHED THEN 
 INSERT (CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 VALUES (DAT.CourseId, DAT.StudentId, DAT.StartDate, DAT.EndDate, DAT.Hours, DAT.Mark);

These are the types of scripts I typically use to insert data into a database, and this independently of tables’ size. For more check Microsoft’s documentation on INSERT. The preparation of the data to be imported can be easily done in Excel with a few macros.

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

01 February 2020

#️⃣☯Software Engineering: Concept Documents (The Good, the Bad and the Ugly)

A concept document (simply a concept) is a document that describes at high level the set of necessary steps and their implications in order to achieve a desired result, typically making the object of a project. In other words, it describes how something can be done or achieved, respectively how a problem can be solved.

The Good: The main aim of the document is to give all the important aspects and to assure that the idea is worthy of consideration, that the steps considered provide a good basis for further work, respectively to provide a good understanding for the various parties involved, Therefore, concepts are used as a basis for the sign-off, respectively for the implementation of software and hardware solutions.

A concept provides information about the context, design, architecture, security, usage, purpose and/or objectives of the future solution together with the set of assumptions, constraints and implications. A concept is not necessarily a recipe because it attempts providing a solution for a given problem or situation that needs a solution. Even if it bears many similarities in content and structure a concept it also not a strategy, because the strategy offers an interpretation of the problem, and also not a business case, because the later focuses mainly on the financial aspects.

A concept proves thus to be a good basis for implementing the described solution, being often an important enabler. On the other side, a written concept is not always necessary, even if conceptualization must exist in implementers’ head.

The Bad: From these considerations projects often consider the elaboration of a concept before further work can be attempted. To write such a document is needed to understand the problem/situation and be capable of sketching a solution in which the various steps or components fit together as the pieces of a puzzle. The problem is that the more complex the problem to be solved, the fuzzier the view and understanding of the various pieces becomes, respectively, the more challenging it becomes to fit the pieces together. In certain situations, it becomes almost impossible for a single person to understand and handle all the pieces. Solving the puzzle becomes a collective approach where the complexity is broken in manageable parts in the detriment of other aspects.

Writing a concept is a time-consuming task. The more accuracy and details are needed, the longer it takes to write and review the document, time that’s usually stolen from other project phases, especially when the phases are considered as sequential. It takes about 20% from the total effort needed to write a ‘perfect’ concept for writing a concept that covers only 80% of the facts, while 80% from the effort to consider the remaining 20% of the facts as the later involve multiple iterations. In extremis, aiming for perfection will make one start the implementation late or not start at all. It’s a not understandable pedantry with an important impact on projects'
timeline and quality in the hope of a quality increase, which is sometimes even illusory.

The Ugly: The concept-based approach is brought to extreme in ERP implementations where for each process or business area is needed to write a concept, which often carries fancy names – solution design document, technical design document, business process document, etc. Independently how it is called, the purpose is to describe how the solution is implemented. The problem is that the conceptualization phase tends to take much longer than planned given the dependencies between the various business area in terms of functionality and activities. The complexity can become overwhelming, with an important impact on project’s budget, time and quality.

Previous Post <<||>> Next Post

31 January 2020

💫🧮☯ERP: Microsoft Dynamics 365 (The Good, the Bad and the Ugly)

The Good: The shift made by Microsoft by porting their Dynamics AX ERP solution to a web-based application (aka D365) hosted in the Microsoft cloud, offered them a boost on the ERP market. The integration with the Office and BI stack, as well Microsoft’s One Version strategy of pushing new features almost on a monthly basis, and of having customers at a maximum 2 releases from the current version, makes from D365 a solution to consider for small to big organizations that span over business sectors and geographies.

The Bad: Currently the monthly release cycle seems to be a challenge for the customers and service providers altogether. Even if the changes in existing functionality are minor, while the functionality is thoroughly tested before releases, the customers still need to test the releases in several systems, especially to assure that the customizations and integrations still work. This can prove to be quite a challenge in which automatic or semiautomatic tools can help when adequately used. Even then, a considerable effort needs to be addressed by the parties involved.

The burden is bigger for the service providers that build their own solutions for D365 as they need to assure in advance that after each release the applications still work. From customers’ perspective, the more such applications they use, the higher the risks of delays in adopting a release or, in extremis, to look for similar solutions. In theory, with good planning and by following best practices the risks are small, though that’s just the theory speaking.

If in the past 2-3 instances were enough to support the ERP during and post implementation, currently the requirements for the cloud-based solution more than doubled, an organization arriving to rent 5-7 D365 instances for the same. Moreover, even if the split between the main blocks (Finance, Supply Chain, Retail and Talent), plus the various Customer Engagement packages, provides some flexibility when thy are combined, this leads to a considerable price increase. Further costs are related to the gaps existing in the available functionality. More likely Microsoft will attempt closing some of the gaps, however until then the customers are forced to opt for existing solutions or have the functionality built. Microsoft pretends that their cloud-based ERP solution provides lower ownership costs, however, looking at the prices, it’s questionable on whether D365 is affordable for small and average organizations. To put it bluntly – think how many socks (aka products) one needs to sell just to cover the implementation, the licensing and infrastructure costs!

One important decision taken by Microsoft was to not allow the direct access to the D365 production database, decision that limits an organization’s choices and flexibility in addressing reporting requirements. Of course, the existing BI infrastructure can still be leveraged with a few workarounds, though the flexibility is lost, while further challenges are involved.

The Ugly: ERP implementations based on D365 make no exceptions from the general trend – given their complexity, they are predisposed to fail achieving the set objectives, and this despite Microsoft’s attempts of creating methodologies, tools and strong communities to support the service providers and customers in such projects. The reasons for failure reside with the customers and service providers altogether, the chains of implications forming a complex network of causalities with multiple levels of reinforcement. When the negative reinforcements break the balance, it can result a tipping point where the things start to go wrong – escalations, finger-pointing, teams’ restructuring, litigations, etc. In extremis, even if the project reaches the finish, the costs can easily reach an overrun of 50-150% from the initial estimation, and that’s a lot to bear.

30 January 2020

💼☯Project Management: Methodologies (The Good, the Bad and the Ugly)

The Good: Nowadays there're several Project Management (PM) methodologies to choose from to address a project’s specifics and, when adapted and applied accordingly, a methodology can enable projects to be run and brought under control.

The Bad: Even if the theoretical basis of PM methodologies has been proved and perfected over the years, projects continue to fail at a disturbing rate. Of course, the reasons behind their failure are multiple, though often the failure reasons are rooted in how PM methodologies are taught, understood and implemented.

Same as a theoretical course in cooking won’t make one a good cook, a theoretical course in PM won’t make one a good Project Manager or knowledgeable team member in applying the learned methodology. Surprisingly, the expectation is exactly that – the team member got a training and is good to go. Moreover, people believe that managing a software project is like coordinating the building of a small treehouse. To some degree there are many similarities though the challenges typically lie in details, and these details often escape a standard course.

To bridge the gap between theory and practice is needed time for the learner to grow in the role, to learn the does and don’ts, and, most important, to learn how to use the tools at hand efficiently. The methodology is itself a tool making use of further tools in its processes – project plans, work breakdown structures, checklists, charters, reports, records, etc. These can be learned only through practice, hopefully with some help (aka mentoring) from an experienced person in the respective methodology, either the Project Manager itself, a trainer or other team member. Same as one can’t be thrown into the water and expected to traverse the Channel Tunnel, you can’t do that with a newbie.

There’s a natural fallacy to think that we've understood more than we have. We can observe our understanding's limits when we are confronted with the complexities involved in handing PM activities. A second fallacy is not believing other people’s warnings against using a tool or performing an activity in a certain way. A newbie’s mind has sometimes the predisposition of a child to try touching a hot stove even if warned against it. It’s part of the learning process, though some persist in such behavior without learning much. What’s even more dangerous is a newbie pretending to be an expert and this almost always ends badly.

The Ugly appears when the bad is brought to extreme, when methodologies are misused for the wrong purposes to the degree that they destroy anything in their way. Of course, a pool can be dug by using a spoon but does it make sense to do that? Just because a tool can be used for something it doesn’t mean it should be used for it as long there are better tools for the same. It seems a pretty logical thing though the contrary happens more often than we’d like. It starts with the preconception that one should use the tool one knows best, ignoring in the process the fit for purpose condition. What’s even more deplorable is breaking down a project to fit a methodology while ignoring the technical and logistical aspects.

Any tool can lead to damages when used excessively, in wrong places, at the wrong point in time or by the wrong person. Like the instruments in an orchestra, when an instrument plays the wrong note, it dissonates from the rest. When more instruments play wrongly, then the piece is unrecognizable. It’s the role of the bandmaster to make the players touch the right notes at the right time.

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link)

SQL Troubles

Pages

17 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part III: Aggregate Functions)

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part II: Joins and Subqueries)

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part I: Inserts)

01 February 2020

#️⃣☯Software Engineering: Concept Documents (The Good, the Bad and the Ugly)

31 January 2020

💫🧮☯ERP: Microsoft Dynamics 365 (The Good, the Bad and the Ugly)

30 January 2020

💼☯Project Management: Methodologies (The Good, the Bad and the Ugly)

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

About Me