16 May 2020

Query Patterns in SQL Server: Joins and Subqueries

The basis for being able to manipulate data via SQL scripting is a good knowledge of using joins and subqueries as seems fit for the purpose and data models. The following scripts are based on the tables created in a previous post.

-- FULL JOIN
SELECT *
FROM dbo.T_Courses C
     JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL JOIN (deprecated)
 SELECT *
 FROM dbo.T_Courses C
 , dbo.T_Students S
 WHERE C.CourseId = S.CourseId
 ORDER BY C.CourseName
 , S.StudentName 

--LEFT JOIN
 SELECT *
 FROM dbo.T_Courses C
      LEFT JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 WHERE S.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- RIGHT JOIN
 SELECT *
 FROM dbo.T_Courses C
      RIGHT JOIN dbo.T_Students S
       ON C.CourseId = S.CourseId 
 WHERE C.CourseId IS NULL
 ORDER BY C.CourseName
 , S.StudentName 

-- FULL OUTER JOIN
 SELECT *
 FROM dbo.T_Courses C
      FULL OUTER JOIN dbo.T_Students S
        ON C.CourseId = S.CourseId 
 --WHERE C.CourseId IS NULL
 --WHERE S.CourseId IS NULL
 --WHERE (C.CourseId IS NULL OR S.StudentId IS NULL)
 --WHERE (C.CourseId IS NULL AND S.StudentId IS NULL)
 ORDER BY C.CourseName
 , S.StudentName 

The IN, NOT IN, EXISTS and NOT EXISTS allow using correlated queries, their use being indicated when there are no actual data needed from the tables involved in the correlated queries:

-- EXISTS (correlated subquery)
SELECT *
FROM dbo.T_Courses C
WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
ORDER BY C.CourseName

-- NOT EXISTS (correlated subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE EXISTS (SELECT StudentId 
        FROM dbo.T_Students S
        WHERE C.CourseId = S.CourseId)
 ORDER BY C.CourseName

-- IN (subquery)
 SELECT *
 FROM dbo.T_Courses C
 WHERE CourseId IN (SELECT CourseId 
        FROM dbo.T_Students S)
 ORDER BY C.CourseName

Joining multiples tables is done using the same principles as above:

-- joins with more tables 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
      JOIN dbo.T_Courses C
        ON A.CourseId = C.CourseId 
      JOIN dbo.T_Students S
        ON A.StudentId = S.StudentId 
 ORDER BY C.CourseName 
 , S.StudentName 

One can obtain the same result via correlated subqueries (a technique often met between Oracle developers). From readability reasons I avoid writing such queries, unless there’s a special purpose to do so.

-- correlated subquery for individual values
SELECT A.CourseId 
, (SELECT C.CourseName 
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT S.StudentName 
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName 

When displaying values within a SELECT via a correlated subqueries, some developers feel the need to use MAX or MIN functions to make sure only one value will be returned. For data analysis it may be acceptable, however if the data model imposes it, then a redesign of the solution is more likely necessary.

-- correlated subquery for individual values
 SELECT A.CourseId 
 , (SELECT Max(C.CourseName)
    FROM dbo.T_Courses C
    WHERE A.CourseId = C.CourseId) CourseName
 , A.StudentId 
 , (SELECT Max(S.StudentName)
    FROM dbo.T_Students S
    WHERE A.StudentId = S.StudentId) StudentName
 , A.StartDate 
 , A.EndDate 
 FROM dbo.T_Allocations A
 ORDER BY CourseName 
 , StudentName 

Another technique not recommended is displaying one or more attributes from the same table with the same conditions via individual correlated queries. The use of aggregate functions is more appropriate however with numerical or date values.

Query Patterns in SQL Server: Inserts

Independently on whether one is a database developer, administrator, or any other professional in the data area, as part of the job is important to be able to manipulate data on the fly. From the availability of DTS (Data Transformation Services) replaced since SQL Server 2005 with SSIS (SQL Server Integration Services), the use of packages for data import become one of the simplest and fastest ways of importing data into a database. It still is, however when dealing with small volume of data, it’s easier to import the data over a few SQL scripts, like in the below examples.
The old-fashion way of loading data is by defining first the target table based on the source’s definition:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

CREATE TABLE [dbo].[T_Courses](
     [CourseId] [int] NOT NULL,
     [CourseName] [varchar](50) NOT NULL
 ) ON [PRIMARY]

-- old-fashioned invidual inserts(SQL Server 2000)
INSERT INTO dbo.T_Courses
VALUES (1, 'Course 1')

INSERT INTO dbo.T_Courses(CourseId, CourseName)
VALUES (2, 'Course 2')

-- inserting multiple rows to a table at once (SQL Server 2005)
INSERT INTO dbo.T_Courses
--OUTPUT inserted.CourseId, inserted.CourseName -- 
VALUES (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')

 SELECT *
 FROM dbo.T_Courses
Sometimes the definition of the table is not important. With SQL Server 2008 it was possible to create the table on the fly, and thus the inserts can be simplified as follows:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

SELECT *
INTO dbo.T_Courses
FROM (
  VALUES (1, 'Course 1')
 , (2, 'Course 2')
 , (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')
 ) C (CourseId, CourseName)


 SELECT *
 FROM dbo.T_Courses
Now, to demonstrate further techniques it is useful to define besides Courses also a Students, respectively an Allocation table, the later storing the relation between a Course and a Student. As can be seen there is in the Students table a reference to the Courses, showing the course in which the student is enrolled currently  The model is only for demonstration purposes:

-- Student data with current courses
-- DROP TABLE IF EXISTS dbo.T_Students

SELECT StudentId
 , StudentName
 , CourseId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 INTO dbo.T_Students
 FROM (
  VALUES (1, 'Student 1', 1, '20170201', '20170228')
 , (2, 'Student 2', 1, '20170201', '20170228')
 , (3, 'Student 3', 2, '20170201', '20170228')
 , (4, 'Student 4', 3, '20170201', '20170228')
 , (5, 'Student 5', NULL, NULL, NULL)
 , (6, 'Student 6', NULL, NULL, NULL)
 , (7, 'Student 7', NULL, NULL, NULL)
 ) S (StudentId, StudentName, CourseId, StartDate, EndDate)


 SELECT *
 FROM dbo.T_Students

-- Course allocations 
-- DROP TABLE IF EXISTS dbo.T_Allocations

SELECT CourseId
 , StudentId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 , Hours
 , Mark 
 INTO dbo.T_Allocations
 FROM (
  VALUES  (1, 1, '20170101', '20170131', 4, 1)
 , (1, 2, '20170101', '20170131', 5, 2)
 , (1, 3, '20170101', '20170131', 12, 3)
 , (2, 1, '20170201', '20170128', 23, 1)
 , (2, 3, '20170201', '20170128', 12, 3)
 , (2, 4, '20170201', '20170128', 21, 1)
 , (1, 5, '20170201', '20170128', 12, 1)
 , (4, 1, '20170101', '20170131', 5, 2)
 , (4, 1, '20170101', '20170131', 4, 2)
 ) A(CourseId, StudentId, StartDate, EndDate, Hours, Mark)

ALTER TABLE dbo.T_Allocations 
ADD AllocationId int IDENTITY(1,1) 

SELECT *
FROM dbo.T_Allocations
With this the data model is ready for use. By the initial design the courses in which the Students are currents enrolled were not supposed to be into the Allocations table. The following query repairs this:

-- insert data from query
INSERT INTO dbo.T_Allocations (CourseId, StudentId, StartDate, EndDate, Hours)
SELECT C.CourseId
, C.StudentId
, C.StartDate
 , C.EndDate
 , 0 Hours
 FROM dbo.T_Students C
      LEFT JOIN dbo.T_Allocations A
        ON C.CourseId = A.CourseId
       AND C.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL
   AND C.CourseId IS NOT NULL
If is needed to append more data, one can modify the corresponding query from above as follows:

-- insert data from values with check against existing data 
INSERT INTO dbo.T_Allocations
SELECT DAT.CourseId
 , DAT.StudentId
 , Cast(DAT.StartDate as Date) StartDate
 , Cast(DAT.EndDate as Date) EndDate
 , DAT.Hours
 , DAT.Mark 
 FROM (
  VALUES  (5, 5, '20170101', '20170131', 4, 2)
 , (5, 5, '20170101', '20170131', 5, 3)
 , (5, 5, '20170101', '20170131', 12, 4)
 , (4, 1, '20170101', '20170131', 4, 2) --already exists 
 ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
   LEFT JOIN dbo.T_Allocations A
     ON DAT.CourseId = A.CourseId
    AND DAT.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL
Starting with the 2008 version, a MERGE statement was introduced in SQL Server, which allows adding further behavior within the same statement for INSERTS, UPDATES or DELETES. The above insert based on a MERGE becomes:
-- insert data via a merge
 MERGE INTO dbo.T_Allocations A USING
 (
     SELECT DAT.CourseId
     , DAT.StudentId
     , Cast(DAT.StartDate as Date) StartDate
     , Cast(DAT.EndDate as Date) EndDate
     , DAT.Hours
     , DAT.Mark 
     FROM (
      VALUES  (5, 5, '20170101', '20170131', 4, 2)
     , (5, 5, '20170101', '20170131', 5, 3)
     , (5, 5, '20170101', '20170131', 12, 4)
     , (4, 1, '20170101', '20170131', 4, 2) --already exists 
     ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 ) DAT
   ON DAT.CourseId = A.CourseId
  AND DAT.StudentId = A.Studentid 
 WHEN NOT MATCHED THEN 
 INSERT (CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 VALUES (DAT.CourseId, DAT.StudentId, DAT.StartDate, DAT.EndDate, DAT.Hours, DAT.Mark);
These are the types of scripts I typically use to insert data into a database, and this independently of tables’ size. For more check Microsoft’s documentation on INSERT. The preparation of the data to be imported can be easily done in Excel with a few macros.

01 February 2020

#️⃣☯Software Engineering: Concept Documents (The Good, the Bad and the Ugly)

Software Engineering

A concept document (simply a concept) is a document that describes at high level the set of necessary steps and their implications in order to achieve a desired result, typically making the object of a project. In other words, it describes how something can be done or achieved, respectively how a problem can be solved.

The GoodThe main aim of the document is to give all the important aspects and to assure that the idea is worthy of consideration, that the steps considered provide a good basis for further work, respectively to provide a good understanding for the various parties involved, Therefore, concepts are used as a basis for the sign-off, respectively for the implementation of software and hardware solutions.

 A concept provides information about the context, design, architecture, security, usage, purpose and/or objectives of the future solution together with the set of assumptions, constraints and implications. A concept is not necessarily a recipe because it attempts providing a solution for a given problem or situation that needs a solution. Even if it bears many similarities in content and structure a concept it also not a strategy, because the strategy offers an interpretation of the problem, and also not a business case, because the later focuses mainly on the financial aspects.

A concept proves thus to be a good basis for implementing the described solution, being often an important enabler. On the other side, a written concept is not always necessary, even if conceptualization must exist in implementers’ head.

The Bad: From these considerations projects often consider the elaboration of a concept before further work can be attempted. To write such a document is needed to understand the problem/situation and be capable of sketching a solution in which the various steps or components fit together as the pieces of a puzzle. The problem is that the more complex the problem to be solved, the fuzzier the view and understanding of the various pieces becomes, respectively, the more challenging it becomes to fit the pieces together. In certain situations, it becomes almost impossible for a single person to understand and handle all the pieces. Solving the puzzle becomes a collective approach where the complexity is broken in manageable parts in the detriment of other aspects.

Writing a concept is a time-consuming task. The more accuracy and details are needed, the longer it takes to write and review the document, time that’s usually stolen from other project phases, especially when the phases are considered as sequential. It takes about 20% from the total effort needed to write a ‘perfect’ concept for writing a concept that covers only 80% of the facts, while 80% from the effort to consider the remaining 20% of the facts as the later involve multiple iterations. In extremis, aiming for perfection will make one start the implementation late or not start at all. It’s a not understandable pedantry with an important impact on projects'
 timeline and quality in the hope of a quality increase, which is sometimes even illusory.

The Ugly: The concept-based approach is brought to extreme in ERP implementations where for each process or business area is needed to write a concept, which often carries fancy names – solution design document, technical design document, business process document, etc. Independently how it is called, the purpose is to describe how the solution is implemented. The problem is that the conceptualization phase tends to take much longer than planned given the dependencies between the various business area in terms of functionality and activities. The complexity can become overwhelming, with an important impact on project’s budget, time and quality.

31 January 2020

💫🧮☯ERP: Microsoft Dynamics 365 (The Good, the Bad and the Ugly)

ERP Implementation

The Good: The shift made by Microsoft by porting their Dynamics AX ERP solution to a web-based application (aka D365) hosted in the Microsoft cloud, offered them a boost on the ERP market. The integration with the Office and BI stack, as well Microsoft’s One Version strategy of pushing new features almost on a monthly basis, and of having customers at a maximum 2 releases from the current version, makes from D365 a solution to consider for small to big organizations that span over business sectors and geographies.

The Bad: Currently the monthly release cycle seems to be a challenge for the customers and service providers altogether. Even if the changes in existing functionality are minor, while the functionality is thoroughly tested before releases, the customers still need to test the releases in several systems, especially to assure that the customizations and integrations still work. This can prove to be quite a challenge in which automatic or semiautomatic tools can help when adequately used. Even then, a considerable effort needs to be addressed by the parties involved.
The burden is bigger for the service providers that build their own solutions for D365 as they need to assure in advance that after each release the applications still work. From customers’ perspective, the more such applications they use, the higher the risks of delays in adopting a release or, in extremis, to look for similar solutions. In theory, with good planning and by following best practices the risks are small, though that’s just the theory speaking.
If in the past 2-3 instances were enough to support the ERP during and post implementation, currently the requirements for the cloud-based solution more than doubled, an organization arriving to rent 5-7 D365 instances for the same. Moreover, even if the split between the main blocks (Finance, Supply Chain, Retail and Talent), plus the various Customer Engagement packages, provides some flexibility when thy are combined, this leads to a considerable price increase. Further costs are related to the gaps existing in the available functionality. More likely Microsoft will attempt closing some of the gaps, however until then the customers are forced to opt for existing solutions or have the functionality built. Microsoft pretends that their cloud-based ERP solution provides lower ownership costs, however, looking at the prices, it’s questionable on whether D365 is affordable for small and average organizations. To put it bluntly – think how many socks (aka products) one needs to sell just to cover the implementation, the licensing and infrastructure costs!
One important decision taken by Microsoft was to not allow the direct access to the D365 production database, decision that limits an organization’s choices and flexibility in addressing reporting requirements. Of course, the existing BI infrastructure can still be leveraged with a few workarounds, though the flexibility is lost, while further challenges are involved.
The Ugly: ERP implementations based on D365 make no exceptions from the general trend – given their complexity they are predisposed to fail achieving the set objectives, and this despite Microsoft’s attempts of creating methodologies, tools and strong communities to support the service providers and customers in such projects. The reasons for failure reside with the customers and service providers altogether, the chains of implications forming a complex network of causalities with multiple levels of reinforcement. When the negative reinforcements break the balance, it can result a tipping point where the things start to go wrong – escalations, finger-pointing, teams’ restructuring, litigations, etc. In extremis, even if the project reaches the finish, the costs can easily reach an overrun of 50-150% from the initial estimation, and that’s a lot to bear.

30 January 2020

💼☯Project Management: Methodologies (The Good, the Bad and the Ugly)

Mismanagement

The Good
: Nowadays there're several Project Management (PM) methodologies to choose from to address a project’s specifics and, when adapted and applied accordingly, a methodology can enable projects to be run and brought under control.

The Bad: Even if the theoretical basis of PM methodologies has been proved and perfected over the years, projects continue to fail at a disturbing rate. Of course, the reasons behind their failure are multiple, though often the failure reasons are rooted in how PM methodologies are taught, understood and implemented.

Same as a theoretical course in cooking won’t make one a good cook, a theoretical course in PM won’t make one a good Project Manager or knowledgeable team member in applying the learned methodology. Surprisingly, the expectation is exactly that – the team member got a training and is good to go. Moreover, people believe that managing a software project is like coordinating the building of a small treehouse. To some degree there are many similarities though the challenges typically lie in details, and these details often escape a standard course.

To bridge the gap between theory and practice is needed time for the learner to grow in the role, to learn the does and don’ts, and, most important, to learn how to use the tools at hand efficiently. The methodology is itself a tool making use of further tools in its processes – project plans, work breakdown structures, checklists, charters, reports, records, etc. These can be learned only through practice, hopefully with some help (aka mentoring) from an experienced person in the respective methodology, either the Project Manager itself, a trainer or other team member. Same as one can’t be thrown into the water and expected to traverse the Channel Tunnel, you can’t do that with a newbie.

There’s a natural fallacy to think that we've understood more than we have. We can observe our understanding's limits when we are confronted with the complexities involved in handing PM activities. A second fallacy is not believing other people’s warnings against using a tool or performing an activity in a certain way. A newbie’s mind has sometimes the predisposition of a child to try touching a hot stove even if warned against it. It’s part of the learning process, though some persist in such behavior without learning much. What’s even more dangerous is a newbie pretending to be an expert and this almost always ends badly.

The Ugly appears when the bad is brought to extreme, when methodologies are misused for the wrong purposes to the degree that they destroy anything in their way. Of course, a pool can be dug by using a spoon but does it make sense to do that? Just because a tool can be used for something it doesn’t mean it should be used for it as long there are better tools for the same. It seems a pretty logical thing though the contrary happens more often than we’d like. It starts with the preconception that one should use the tool one knows best, ignoring in the process the fit for purpose condition. What’s even more deplorable is breaking down a project to fit a methodology while ignoring the technical and logistical aspects.

Any tool can lead to damages when used excessively, in wrong places, at the wrong point in time or by the wrong person. Like the instruments in an orchestra, when an instrument plays the wrong note, it dissonates from the rest. When more instruments play wrongly, then the piece is unrecognizable. It’s the role of the bandmaster to make the players touch the right notes at the right time.

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

Data Management

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.  

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link

25 December 2019

#️⃣Software Engineering: Mea Culpa (Part II: The Beginnings)

Software Engineering
Software Engineering Series

I started programming at 14-15 years old with logical schemas made on paper, based mainly on simple mathematical algorithms like solving equations of second degree, finding prime or special numbers, and other simple tricks from the mathematical world available for a student at that age. It was challenging to learn programming based only on schemas, though, looking back, I think it was the best learning basis a programmer could have, because it allowed me thinking logically and it was also a good exercise, as one was forced to validate mentally or on paper the outputs.

Then I moved to learning Basic and later Pascal on old generation Spectrum computers, mainly having a keyboard with 64K memory and an improvised monitor. It felt almost like a holiday when one had the chance to work 45 minutes or so on an IBM computer with just 640K memory. It was also a motivation to stay long after hours to write a few more lines of code. Even if it made no big difference in what concerns the speed, the simple idea of using a more advanced computer was a big deal.

The jump from logical schemas to actual programming was huge, as we moved from static formulas to exploratory methods like the ones of finding the roots of equations of upper degrees by using approximation methods, working with permutations and a few other combinatoric tools, interpolation methods, and so on. Once I got my own 64K Spectrum keyboard, a new world opened, having more time to play with 2- and 3-dimensional figures, location problems and so on. It was probably the time I got most interesting exposure to things not found in the curricula.  

Further on, during the university years I moved to Fortran, back to Pascal and dBASE, and later to C and C++, the focus being further on mathematical and sorting algorithms, working with matrices, and so on. I have to admit that it was a big difference between the students who came from 2-3 hours of Informatics per week (like I did) and the ones coming from lyceums specialized on Informatics, this especially during years in which learning materials were almost inexistent. In the end all went well.

The jumping through so many programming languages, some quite old for the respective times, even if allowed acquiring different perspectives, it felt sometimes like  a waste of time, especially when one was limited to using the campus computers, and that only during lab hours. That was the reality of those times. Fortunately, the university years went faster than they came. Almost one year after graduation, with a little help, some effort and benevolence, I managed to land a job as web developer, jumping from an interlude with Java to ASP, JavaScript, HTML, ColdFusion, ActionScript, SQL, XML and a few other programming languages ‘en vogue’ during the 2000.

Somewhere between graduation and my first job, my life changed when I was able to buy my own PC (a Pentium). It was the best investment I could make, mainly because it allowed me to be independent of what I was doing at work. It allowed me learning the basics of OOP programming based on Visual Basic and occasionally on Visual C++ and C#. Most of the meaningful learning happened after work, from the few books available, full of mistakes and other challenges.

That was my beginning. It is not my intent to brag about how much or how many programming languages I learned - knowledge is anyway relative - but to differentiate between the realities of then and today, as a bridge over time.

Previous Post <<||>>  Next Post

22 December 2019

SQL Server New Features: Using the R Language in SQL Server 2016 (Hello World & Working with Data Frames)

One of the most interesting features coming with SQL Server 2016 is the possibility to run external scripts written in the R language or Python, taking thus advantage of the numerical and statistical packages coming with the respective languages. The next examples are based on the R language.

As the scripts in R are considered as external scripts, is needed first to enable the 'external scripts enabled' configuration option by using the following script (a server restart is required):

-- enable external scripts 
sp_configure 'external scripts enabled', 1;
RECONFIGURE WITH OVERRIDE;

To makes sure that the functionality works as expected, it makes sense to attempt first a "hello world" example:

-- hello world script
EXECUTE sp_execute_external_script 
        @language = N'R',  
        @script = N'print("Hello world")'



The R language come with a few predefined datasets and for the following examples I’ll be using the mtcars dataset. Because the dataset contains several columns, I will use only the first 3. For this one care run a script as the following into the R console:


Usually it’s useful to look first at the structure of the dataset, this by using str(mtcars) command in the R console:


To return the dataset from R one can use the following call to the sp_execute_external_script stored procedure:

  -- returning the first 3 columns 
 EXEC sp_execute_external_script  
       @language = N'R'  
     , @script = N'cars <- mtcars[1:3];'
     , @input_data_1 = N''  
     , @output_data_1_name = N'cars'
     WITH RESULT SETS (("mpg" float not null 
         , "cyl" float not null 
         , "disp" float not null 
       ));  


As can be seen, besides the script is needed to define a variable in which the returning dataset is stored, as well the resulting dataset. Unfortunately, this script doesn’t return rows’ names. To do that I had to use a small trick by concatenating the initial data frame with the one resulting from row’s names. (Please let me know if you have another way of achieving the same.)

The script becomes:

 -- returning the first 3 columns including rows' name
 EXEC   sp_execute_external_script  
       @language = N'R'  
     , @script = N'cars <- data.frame(rownames(mtcars), mtcars[1:3]);'
     , @input_data_1 = N''  
     , @output_data_1_name = N'cars'
     WITH RESULT SETS (("Car" varchar(100)
      , "mpg" float not null 
      , "cyl" float not null 
      , "disp" float not null 
       ));  



To reuse the script, it can be included in a stored procedure, similarly like the examples provided by the Microsoft documentation for the sp_execute_external_script stored procedure.

A few pointers:
1. It’s useful to test your scripts first in the R console.
2. If 'external scripts enabled' was not enabled, then the following error message will appear:
Msg 39023, Level 16, State 1, Procedure sp_execute_external_script, Line 1 [Batch Start Line 22]
'sp_execute_external_script' is disabled on this instance of SQL Server. Use sp_configure 'external scripts enabled' to enable it.

3. It might be needed to start the “SQL Server Launchpad” service manually, a hint in this direction comes from the following error message:
Msg 39011, Level 16, State 1, Line 24
SQL Server was unable to communicate with the LaunchPad service. Please verify the configuration of the service.
 
4. Once the examples tested, it might be recommended to disable the 'external scripts enabled' configuration option as long is not needed anymore.
5. Check the various ways to analyse the mtcars dataset using the R language: https://rpubs.com/BillB/217355

📊R Language: Drawing Function Plots (Part I - Basic Curves)

Besides the fact that is free, one of the advantages of the R language is that it allows drawing function plots with simple commands. All one needs is software package like Microsoft R Open and one is set to go.

For example, to draw the function f(x) = x^3-3*x+2 all one needs to do is to type the following command into the console:

curve(x^3-3*x+2, -3,3)



One can display a second function into the same chart by adding a second command and repeat the process further as needed:

curve(x^2-3*x+2, add=TRUE, col="blue")
curve(x^4-3*x+2, add=TRUE, col="red")
curve(x^5-3*x+2, add=TRUE, col="green")


As one can see a pattern emerges already …

One could easily display the plots in their one section by defining an array on the display device, allowing thus to focus on special characteristics of the functions:

par(mfrow=c(2,2))
curve(x^3-3*x+2, -3,3)
curve(x^2-3*x+2, -3,3, col="blue")
curve(x^4-3*x+2, -3,3, col="red")
curve(x^5-3*x+2, -3,3, col="green")



One can be creative and display the functions using a loop:

par(mfrow=c(2,2))
for(n in c(2:5)){
curve(x^n-3*x+2, -3,3)
}


Similarly, one can plot trigonometric functions:

par(mfrow=c(2,3))
curve(sin(x),-pi,pi) 
curve(cos(x),-pi,pi) 
curve(tan(x),-pi,pi) 
curve(1/tan(x),-pi,pi) 
curve(asin(x),-1,1)
curve(acos(x),-1,1)


The possibilities are endless. For complex functions one can include the function into an r function:

myFunction<- function(x) sin(cos(x)*exp(-x/2))
curve(myFunction, -15, 15, n=1000)


For the SQL Server developers, what’s even greater is the possibility of using Management Studio for the same, though that’s a topic for another post.

 
Happy coding!

Previous Post <<||>> Next Post
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.