SQL Troubles

16 May 2020

💎🏭SQL Reloaded: Query Patterns in SQL Server (Part I: Inserts)

Independently on whether one is a database developer, administrator, or any other professional in the data area, as part of the job is important to be able to manipulate data on the fly. From the availability of DTS (Data Transformation Services) replaced since SQL Server 2005 with SSIS (SQL Server Integration Services), the use of packages for data import become one of the simplest and fastest ways of importing data into a database. It still is, however when dealing with small volume of data, it’s easier to import the data over a few SQL scripts, like in the below examples.
The old-fashion way of loading data is by defining first the target table based on the source’s definition:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

CREATE TABLE [dbo].[T_Courses](
     [CourseId] [int] NOT NULL,
     [CourseName] [varchar](50) NOT NULL
 ) ON [PRIMARY]

-- old-fashioned invidual inserts(SQL Server 2000)
INSERT INTO dbo.T_Courses
VALUES (1, 'Course 1')

INSERT INTO dbo.T_Courses(CourseId, CourseName)
VALUES (2, 'Course 2')

-- inserting multiple rows to a table at once (SQL Server 2005)
INSERT INTO dbo.T_Courses
--OUTPUT inserted.CourseId, inserted.CourseName -- 
VALUES (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')

 SELECT *
 FROM dbo.T_Courses

Sometimes the definition of the table is not important. With SQL Server 2008 it was possible to create the table on the fly, and thus the inserts can be simplified as follows:

-- Course data 
--DROP TABLE IF EXISTS dbo.T_Courses 

SELECT *
INTO dbo.T_Courses
FROM (
  VALUES (1, 'Course 1')
 , (2, 'Course 2')
 , (3, 'Course 3')
 , (4, 'Course 4')
 , (5, 'Course 5')
 , (6, 'Course 6')
 , (7, 'Course 7')
 ) C (CourseId, CourseName)


 SELECT *
 FROM dbo.T_Courses

Now, to demonstrate further techniques it is useful to define besides Courses also a Students, respectively an Allocation table, the later storing the relation between a Course and a Student. As can be seen there is in the Students table a reference to the Courses, showing the course in which the student is enrolled currently The model is only for demonstration purposes:

-- Student data with current courses
-- DROP TABLE IF EXISTS dbo.T_Students

SELECT StudentId
 , StudentName
 , CourseId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 INTO dbo.T_Students
 FROM (
  VALUES (1, 'Student 1', 1, '20170201', '20170228')
 , (2, 'Student 2', 1, '20170201', '20170228')
 , (3, 'Student 3', 2, '20170201', '20170228')
 , (4, 'Student 4', 3, '20170201', '20170228')
 , (5, 'Student 5', NULL, NULL, NULL)
 , (6, 'Student 6', NULL, NULL, NULL)
 , (7, 'Student 7', NULL, NULL, NULL)
 ) S (StudentId, StudentName, CourseId, StartDate, EndDate)


 SELECT *
 FROM dbo.T_Students

-- Course allocations 
-- DROP TABLE IF EXISTS dbo.T_Allocations

SELECT CourseId
 , StudentId
 , Cast(StartDate as Date) StartDate
 , Cast(EndDate as Date) EndDate
 , Hours
 , Mark 
 INTO dbo.T_Allocations
 FROM (
  VALUES  (1, 1, '20170101', '20170131', 4, 1)
 , (1, 2, '20170101', '20170131', 5, 2)
 , (1, 3, '20170101', '20170131', 12, 3)
 , (2, 1, '20170201', '20170128', 23, 1)
 , (2, 3, '20170201', '20170128', 12, 3)
 , (2, 4, '20170201', '20170128', 21, 1)
 , (1, 5, '20170201', '20170128', 12, 1)
 , (4, 1, '20170101', '20170131', 5, 2)
 , (4, 1, '20170101', '20170131', 4, 2)
 ) A(CourseId, StudentId, StartDate, EndDate, Hours, Mark)

ALTER TABLE dbo.T_Allocations 
ADD AllocationId int IDENTITY(1,1) 

SELECT *
FROM dbo.T_Allocations

With this the data model is ready for use. By the initial design the courses in which the Students are currents enrolled were not supposed to be into the Allocations table. The following query repairs this:

-- insert data from query
INSERT INTO dbo.T_Allocations (CourseId, StudentId, StartDate, EndDate, Hours)
SELECT C.CourseId
, C.StudentId
, C.StartDate
 , C.EndDate
 , 0 Hours
 FROM dbo.T_Students C
      LEFT JOIN dbo.T_Allocations A
        ON C.CourseId = A.CourseId
       AND C.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL
   AND C.CourseId IS NOT NULL

If is needed to append more data, one can modify the corresponding query from above as follows:

-- insert data from values with check against existing data 
INSERT INTO dbo.T_Allocations
SELECT DAT.CourseId
 , DAT.StudentId
 , Cast(DAT.StartDate as Date) StartDate
 , Cast(DAT.EndDate as Date) EndDate
 , DAT.Hours
 , DAT.Mark 
 FROM (
  VALUES  (5, 5, '20170101', '20170131', 4, 2)
 , (5, 5, '20170101', '20170131', 5, 3)
 , (5, 5, '20170101', '20170131', 12, 4)
 , (4, 1, '20170101', '20170131', 4, 2) --already exists 
 ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
   LEFT JOIN dbo.T_Allocations A
     ON DAT.CourseId = A.CourseId
    AND DAT.StudentId = A.Studentid 
 WHERE A.CourseId IS NULL

Starting with the 2008 version, a MERGE statement was introduced in SQL Server, which allows adding further behavior within the same statement for INSERTS, UPDATES or DELETES. The above insert based on a MERGE becomes:

-- insert data via a merge
 MERGE INTO dbo.T_Allocations A USING
 (
     SELECT DAT.CourseId
     , DAT.StudentId
     , Cast(DAT.StartDate as Date) StartDate
     , Cast(DAT.EndDate as Date) EndDate
     , DAT.Hours
     , DAT.Mark 
     FROM (
      VALUES  (5, 5, '20170101', '20170131', 4, 2)
     , (5, 5, '20170101', '20170131', 5, 3)
     , (5, 5, '20170101', '20170131', 12, 4)
     , (4, 1, '20170101', '20170131', 4, 2) --already exists 
     ) DAT(CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 ) DAT
   ON DAT.CourseId = A.CourseId
  AND DAT.StudentId = A.Studentid 
 WHEN NOT MATCHED THEN 
 INSERT (CourseId, StudentId, StartDate, EndDate, Hours, Mark)
 VALUES (DAT.CourseId, DAT.StudentId, DAT.StartDate, DAT.EndDate, DAT.Hours, DAT.Mark);

These are the types of scripts I typically use to insert data into a database, and this independently of tables’ size. For more check Microsoft’s documentation on INSERT. The preparation of the data to be imported can be easily done in Excel with a few macros.

Notes:

The queries work also in SQL databases in Microsoft Fabric.

Happy coding!

Previous Post <<||>> Next Post

01 February 2020

#️⃣☯Software Engineering: Concept Documents (The Good, the Bad and the Ugly)

A concept document (simply a concept) is a document that describes at high level the set of necessary steps and their implications in order to achieve a desired result, typically making the object of a project. In other words, it describes how something can be done or achieved, respectively how a problem can be solved.

The Good: The main aim of the document is to give all the important aspects and to assure that the idea is worthy of consideration, that the steps considered provide a good basis for further work, respectively to provide a good understanding for the various parties involved, Therefore, concepts are used as a basis for the sign-off, respectively for the implementation of software and hardware solutions.

A concept provides information about the context, design, architecture, security, usage, purpose and/or objectives of the future solution together with the set of assumptions, constraints and implications. A concept is not necessarily a recipe because it attempts providing a solution for a given problem or situation that needs a solution. Even if it bears many similarities in content and structure a concept it also not a strategy, because the strategy offers an interpretation of the problem, and also not a business case, because the later focuses mainly on the financial aspects.

A concept proves thus to be a good basis for implementing the described solution, being often an important enabler. On the other side, a written concept is not always necessary, even if conceptualization must exist in implementers’ head.

The Bad: From these considerations projects often consider the elaboration of a concept before further work can be attempted. To write such a document is needed to understand the problem/situation and be capable of sketching a solution in which the various steps or components fit together as the pieces of a puzzle. The problem is that the more complex the problem to be solved, the fuzzier the view and understanding of the various pieces becomes, respectively, the more challenging it becomes to fit the pieces together. In certain situations, it becomes almost impossible for a single person to understand and handle all the pieces. Solving the puzzle becomes a collective approach where the complexity is broken in manageable parts in the detriment of other aspects.

Writing a concept is a time-consuming task. The more accuracy and details are needed, the longer it takes to write and review the document, time that’s usually stolen from other project phases, especially when the phases are considered as sequential. It takes about 20% from the total effort needed to write a ‘perfect’ concept for writing a concept that covers only 80% of the facts, while 80% from the effort to consider the remaining 20% of the facts as the later involve multiple iterations. In extremis, aiming for perfection will make one start the implementation late or not start at all. It’s a not understandable pedantry with an important impact on projects'
timeline and quality in the hope of a quality increase, which is sometimes even illusory.

The Ugly: The concept-based approach is brought to extreme in ERP implementations where for each process or business area is needed to write a concept, which often carries fancy names – solution design document, technical design document, business process document, etc. Independently how it is called, the purpose is to describe how the solution is implemented. The problem is that the conceptualization phase tends to take much longer than planned given the dependencies between the various business area in terms of functionality and activities. The complexity can become overwhelming, with an important impact on project’s budget, time and quality.

31 January 2020

💫🧮☯ERP: Microsoft Dynamics 365 (The Good, the Bad and the Ugly)

The Good: The shift made by Microsoft by porting their Dynamics AX ERP solution to a web-based application (aka D365) hosted in the Microsoft cloud, offered them a boost on the ERP market. The integration with the Office and BI stack, as well Microsoft’s One Version strategy of pushing new features almost on a monthly basis, and of having customers at a maximum 2 releases from the current version, makes from D365 a solution to consider for small to big organizations that span over business sectors and geographies.

The Bad: Currently the monthly release cycle seems to be a challenge for the customers and service providers altogether. Even if the changes in existing functionality are minor, while the functionality is thoroughly tested before releases, the customers still need to test the releases in several systems, especially to assure that the customizations and integrations still work. This can prove to be quite a challenge in which automatic or semiautomatic tools can help when adequately used. Even then, a considerable effort needs to be addressed by the parties involved.

The burden is bigger for the service providers that build their own solutions for D365 as they need to assure in advance that after each release the applications still work. From customers’ perspective, the more such applications they use, the higher the risks of delays in adopting a release or, in extremis, to look for similar solutions. In theory, with good planning and by following best practices the risks are small, though that’s just the theory speaking.

If in the past 2-3 instances were enough to support the ERP during and post implementation, currently the requirements for the cloud-based solution more than doubled, an organization arriving to rent 5-7 D365 instances for the same. Moreover, even if the split between the main blocks (Finance, Supply Chain, Retail and Talent), plus the various Customer Engagement packages, provides some flexibility when thy are combined, this leads to a considerable price increase. Further costs are related to the gaps existing in the available functionality. More likely Microsoft will attempt closing some of the gaps, however until then the customers are forced to opt for existing solutions or have the functionality built. Microsoft pretends that their cloud-based ERP solution provides lower ownership costs, however, looking at the prices, it’s questionable on whether D365 is affordable for small and average organizations. To put it bluntly – think how many socks (aka products) one needs to sell just to cover the implementation, the licensing and infrastructure costs!

One important decision taken by Microsoft was to not allow the direct access to the D365 production database, decision that limits an organization’s choices and flexibility in addressing reporting requirements. Of course, the existing BI infrastructure can still be leveraged with a few workarounds, though the flexibility is lost, while further challenges are involved.

The Ugly: ERP implementations based on D365 make no exceptions from the general trend – given their complexity they are predisposed to fail achieving the set objectives, and this despite Microsoft’s attempts of creating methodologies, tools and strong communities to support the service providers and customers in such projects. The reasons for failure reside with the customers and service providers altogether, the chains of implications forming a complex network of causalities with multiple levels of reinforcement. When the negative reinforcements break the balance, it can result a tipping point where the things start to go wrong – escalations, finger-pointing, teams’ restructuring, litigations, etc. In extremis, even if the project reaches the finish, the costs can easily reach an overrun of 50-150% from the initial estimation, and that’s a lot to bear.

30 January 2020

💼☯Project Management: Methodologies (The Good, the Bad and the Ugly)

The Good: Nowadays there're several Project Management (PM) methodologies to choose from to address a project’s specifics and, when adapted and applied accordingly, a methodology can enable projects to be run and brought under control.

The Bad: Even if the theoretical basis of PM methodologies has been proved and perfected over the years, projects continue to fail at a disturbing rate. Of course, the reasons behind their failure are multiple, though often the failure reasons are rooted in how PM methodologies are taught, understood and implemented.

Same as a theoretical course in cooking won’t make one a good cook, a theoretical course in PM won’t make one a good Project Manager or knowledgeable team member in applying the learned methodology. Surprisingly, the expectation is exactly that – the team member got a training and is good to go. Moreover, people believe that managing a software project is like coordinating the building of a small treehouse. To some degree there are many similarities though the challenges typically lie in details, and these details often escape a standard course.

To bridge the gap between theory and practice is needed time for the learner to grow in the role, to learn the does and don’ts, and, most important, to learn how to use the tools at hand efficiently. The methodology is itself a tool making use of further tools in its processes – project plans, work breakdown structures, checklists, charters, reports, records, etc. These can be learned only through practice, hopefully with some help (aka mentoring) from an experienced person in the respective methodology, either the Project Manager itself, a trainer or other team member. Same as one can’t be thrown into the water and expected to traverse the Channel Tunnel, you can’t do that with a newbie.

There’s a natural fallacy to think that we've understood more than we have. We can observe our understanding's limits when we are confronted with the complexities involved in handing PM activities. A second fallacy is not believing other people’s warnings against using a tool or performing an activity in a certain way. A newbie’s mind has sometimes the predisposition of a child to try touching a hot stove even if warned against it. It’s part of the learning process, though some persist in such behavior without learning much. What’s even more dangerous is a newbie pretending to be an expert and this almost always ends badly.

The Ugly appears when the bad is brought to extreme, when methodologies are misused for the wrong purposes to the degree that they destroy anything in their way. Of course, a pool can be dug by using a spoon but does it make sense to do that? Just because a tool can be used for something it doesn’t mean it should be used for it as long there are better tools for the same. It seems a pretty logical thing though the contrary happens more often than we’d like. It starts with the preconception that one should use the tool one knows best, ignoring in the process the fit for purpose condition. What’s even more deplorable is breaking down a project to fit a methodology while ignoring the technical and logistical aspects.

Any tool can lead to damages when used excessively, in wrong places, at the wrong point in time or by the wrong person. Like the instruments in an orchestra, when an instrument plays the wrong note, it dissonates from the rest. When more instruments play wrongly, then the piece is unrecognizable. It’s the role of the bandmaster to make the players touch the right notes at the right time.

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link)

31 December 2019

📉Graphical Representation: Reading (Just the Quotes)

"Information that is imperfectly acquired, is generally as imperfectly retained; and a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read; and that like a figure imprinted on sand, is soon totally erased and defaced." (William Playfair, "The Commercial and Political Atlas", 1786)

"Graphic methods convey to the mind a more comprehensive grasp of essential features than do written reports, because one can naturally gather interesting details from a picture in far less time than from a written description. Further than this, the examination of a picture allows one to make deductions of his own, while in the case of a written description the reader must, to a great degree, accept the conclusions of the author." (Allan C Haskell, "How to Make and Use Graphic Charts", 1919)

"In many presentations it is not a question of saving time to the reader but a question of placing the arguments in such form that results may surely be obtained. For matters affecting public welfare, it is hard to estimate the benefits which may accrue if a little care be used in presenting data so that they will be convincing to the reader." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"It should be a strict rule for all kinds of curve plotting that the horizontal scale must be used. for the independent variable and the vertical scale for the dependent variable. When the curves are plotted by this rule the reader can instantly select a set of conditions from the horizontal scale and read the information from the vertical scale. If there were no rule relating to the arrangement of scales for the independent and dependent variables, the reader would never be able to tell whether he should approach a chart from the vertical scale and read the information from the horizontal scale, or the reverse." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Sometimes the scales of these accompanying charts are so large that the reader is puzzled to get clearly in his mind what the whole chart is driving at. There is a possibility of making a simple chart on such a large scale that the mere size of the chart adds to its complexity by causing the reader to glance from one side of the chart to the other in trying to get a condensed visualization of the chart." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Unlimited numbers of reports, magazines, and newspapers are now giving us reams of quantitative facts. If the facts were put in graphic form, not only would there be a great saving in the time of the readers but there would be infinite gain to society, because more facts could be absorbed and with less danger of misinterpretation. Graphic methods usually require no more space than is needed if the facts are presented in the form of words. In many cases, the graphic method requires less space than is required for words and there is, besides, the great advantage that with graphic methods facts are presented so that the reader may make deductions of his own, while when words are used the reader must usually accept the ready-made conclusions handed to him." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Readers of statistical diagrams should not be required to compare magnitudes in more than one dimension. Visual comparisons of areas are particularly inaccurate and should not be necessary in reading any statistical graphical diagram." (William C Marshall, "Graphical methods for schools, colleges, statisticians, engineers and executives", 1921)

"Graphic charts have often been thought to be tools of those alone who are highly skilled in mathematics, but one needs to have a knowledge of only eighth-grade arithmetic to use intelligently even the logarithmic or ratio chart, which is considered so difficult by those unfamiliar with it. […] If graphic methods are to be most effective, those who are unfamiliar with charts must give some attention to their fundamental structure. Even simple charts may be misinterpreted unless they are thoroughly understood. For instance, one is not likely to read an arithmetic chart correctly unless he also appreciates the significance of a logarithmic chart." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"The eye can accurately appraise only very few features of a diagram, and consequently a complicated or confusing diagram will lead the reader astray. The fundamental rule for all charting is to use a plan which is simple and which takes account, in its arrangement of the facts to be presented, of the above-mentioned capacities of the eye. " (William L Crum et al, "Introduction to Economic Statistics", 1938)

"[…] statistical literacy. That is, the ability to read diagrams and maps; a 'consumer' understanding of common statistical terms, as average, percent, dispersion, correlation, and index number. " (Douglas Scates, "Statistics: The Mathematics for Social Problems", 1943)

"Admittedly a chart is primarily a picture, and for presentation purposes should be treated as such; but in most charts it is desirable to be able to read the approximate magnitudes by reference to the scales. Such reference is almost out of the question without some rulings to guide the eye. Second, the picture itself may be misleading without enough rulings to keep the eye 'honest'. Although sight is the most reliable of our senses for measuring" (and most other) purposes, the unaided eye is easily deceived; and there are numerous optical illusions to prove it. A third reason, not vital, but still of some importance, is that charts without rulings may appear weak and empty and may lack the structural unity desirable in any illustration." (Kenneth W Haemer, "Hold That Line. A Plea for the Preservation of Chart Scale Ruling", The American Statistician Vol. 1" (1) 1947)

"[…] many readers are confused by the presence of two scales, and either use the wrong one or simply disregard both. Also, the general reader has the disconcerting habit of believing that because one curve is higher than another, it is also larger in magnitude. This leads to all sorts of misconceptions." (Kenneth W Haemer, "Double Scales Are Dangerous", The American Statistician Vol. 2" (3) , 1948)

"Besides being easier to construct than a bar chart, the line chart possesses other advantages. It is easier to read, for while the bars stand out more prominently than the line, they tend to become confusing if numerous, and especially so when they record alternate increase and decrease. It is easier for the eye to follow a line across the face of the chart than to jump from bar top to bar top, and the slope of the line connecting two points is a great aid in detecting minor changes. The line is also more suggestive of movement than arc bars, and movement is the very essence of a time series. Again, a line chart permits showing two or more related variables on the same chart, or the same variable over two or more corresponding periods." (Walter E Weld, "How to Chart; Facts from Figures with Graphs", 1959)

"Circles of different size, however cannot properly be used to compare the size of different totals. This is because the reader does not know whether to compare the diameters or the areas" (which vary as the squares of the diameters), and is likely to misjudge the comparison in either ease. Usually the circles are drawn so that their diameters are in correct proportion to each other; but then the area comparison is exaggerated. Component bars should be used to show totals of different size since their one dimension lengths can be easily judged not only for the totals themselves but for the component parts as well. Circles, therefore, can show proportions properly by variations in angles of sectors but not by variations in diameters. " (Anna C Rogers, "Graphic Charts Handbook", 1961)

"In line charts the grid structure plays a controlling role in interpreting facts. The number of vertical rulings should be sufficient to indicate the frequency of the plottings, facilitate the reading of the time values on the horizontal scale. and indicate the interval or subdivision of time." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"The use of trivial data - particularly in graphic presentation - can easily tire the reader so that he soon becomes disinterested. Graphs should be for information considered highly significant. not for unimportant points." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"A graph presents a limited number of figures in a bold and forceful manner. To do this it usually must omit a large number of figures available on the subject. The choice of what graphic format to use is largely a matter of deciding what figures have the greatest significance to the intended reader and what figures he can best afford to skip." (Peter H Selby, "Interpreting Graphs and Tables", 1976)

"Graphic forms help us to perform and influence two critical functions of the mind: the gathering of information and the processing of that information. Graphs and charts are ways to increase the effectiveness and the efficiency of transmitting information in a way that enhances the reader's ability to process that information. Graphics are tools to help give meaning to information because they go beyond the provision of information and show relationships, trends, and comparisons. They help to distinguish which numbers and which ideas are more important than others in a presentation." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"Some believe that the vertical bar should be used when comparing similar items for different time periods and the horizontal bar for comparing different items for the same time period. However, most people find the vertical-bar format easier to prepare and read. and a more effective way to show most types of comparisons." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"The preparation of well-designed graphics is both an art and a skill. There are many different ways to go about the task, and readers are urged to develop their own approaches. Graphics can be creative and fun. At the same time, they require a degree of orderly and systematic work." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"Unlike some art forms. good graphics should be as concrete, geometrical, and representational as possible. A rectangle should be drawn as a rectangle, leaving nothing to the reader's imagination about what you are trying to portray. The various lines and shapes used in a graphic chart should be arranged so that it appears to be balanced. This balance is a result of the placement of shapes and lines in an orderly fashion." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"The bar graph and the column graph are popular because they are simple and easy to read. These are the most versatile of the graph forms. They can be used to display time series, to display the relationship between two items, to make a comparison among several items, and to make a comparison between parts and the whole" (total). They do not appear to be as 'statistical', which is an advantage to those people who have negative attitudes toward statistics. The column graph shows values over time, and the bar graph shows values at a point in time. bar graph compares different items as of a specific time" (not over time)." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"The scales used are important; contracting or expanding the vertical or horizontal scales will change the visual picture. The trend lines need enough grid lines to obviate difficulty in reading the results properly. One must be careful in the use of cross-hatching and shading, both of which can create illusions. Horizontal rulings tend to reduce the appearance. while vertical lines enlarge it. In summary, graphs must be reliable, and reliability depends not only on what is presented but also on how it is presented." (Anker V Andersen, "Graphing Financial Information: How accountants can use graphs to communicate", 1983)

"[…] the partial scale break is a weak indicator that the reader can fail to appreciate fully; visually the graph is still a single panel that invites the viewer to see, inappropriately, patterns between the two scales. […] The partial scale break also invites authors to connect points across the break, a poor practice indeed; […]" (William S. Cleveland, "Graphical Methods for Data Presentation: Full Scale Breaks, Dot Charts, and Multibased Logging", The American Statistician Vol. 38" (4) 1984)

"The effective communication of information in visual form, whether it be text, tables, graphs, charts or diagrams, requires an understanding of those factors which determine the 'legibility', 'readability' and 'comprehensibility', of the information being presented. By legibility we mean: can the data be clearly seen and easily read? By readability we mean: is the information set out in a logical way so that its structure is clear and it can be easily scanned? By comprehensibility we mean: does the data make sense to the audience for whom it is intended? Is the presentation appropriate for their previous knowledge, their present information needs and their information processing capacities?" (Linda Reynolds & Doig Simmonds, "Presentation of Data in Science" 4th Ed, 1984)

"Wherever possible, numerical tables should be explicit rather than implicit, i.e. the information should be given in full. In an implicit table, the reader may be required to add together two values in order to obtain a third which is not explicitly stated in the table. […] Implicit tables save space, but require more effort on the part of the reader and may cause confusion and errors. They are particularly unsuitable for slides and other transient displays." (Linda Reynolds & Doig Simmonds, "Presentation of Data in Science" 4th Ed, 1984)

"A chart is a bridge between you and your readers. It reveals your skills at comprehending the source information, at mastering presentation methods and at producing the design. Its success depends a great deal on your readers ' understanding of what you are saying, and how you are saying it. Consider how they will use your chart. Will they want to find out from it more information about the subject? Will they just want a quick impression of the data? Or will they use it as a source for their own analysis? Charts rely upon a visual language which both you and your readers must understand." (Bruce Robertson, "How to Draw Charts & Diagrams", 1988)

"Good graphics can be spoiled by bad annotation. Labels must always be subservient to the information to be conveyed, and legibility should never be sacrificed for style. All the information on the sheet should be easy to read, and more important, easy to interpret. The priorities of the information should be clearly expressed by the use of differing sizes, weights and character of letters." (Bruce Robertson, "How to Draw Charts & Diagrams", 1988)

"Gray grids almost always work well and, with a delicate line, may promote more accurate data reading and reconstruction than a heavy grid. Dark grid lines are chartjunk. When a graphic serves as a look-up table" (rare indeed), then a grid may help with reading and interpolation. But even then the grid should be muted relative to the data." (Edward R Tufte, "Envisioning Information", 1990)

"What about confusing clutter? Information overload? Doesn't data have to be ‘boiled down’ and ‘simplified’? These common questions miss the point, for the quantity of detail is an issue completely separate from the difficulty of reading. Clutter and confusion are failures of design, not attributes of information." (Edward R Tufte, "Envisioning Information", 1990)

"The illusion of randomness gradually disappears as the skill in chart reading improves." (John W Murphy, "Technical Analysis of the Financial Markets", 1999)

"Displaying numerical information always involves selection. The process of selection needs to be described so that the reader will not be misled." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Diagrams are a means of communication and explanation, and they facilitate brainstorming. They serve these ends best if they are minimal. Comprehensive diagrams of the entire object model fail to communicate or explain; they overwhelm the reader with detail and they lack meaning." (Eric Evans, "Domain-Driven Design: Tackling complexity in the heart of software", 2003)

"Data often arrive in raw form, as long lists of numbers. In this case your job is to summarize the data in a way that captures its essence and conveys its meaning. This can be done numerically, with measures such as the average and standard deviation, or graphically. At other times you find data already in summarized form; in this case you must understand what the summary is telling, and what it is not telling, and then interpret the information for your readers or viewers." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Merely drawing a plot does not constitute visualization. Visualization is about conveying important information to the reader accurately. It should reveal information that is in the data and should not impose structure on the data." (Robert Gentleman, "Bioinformatics and Computational Biology Solutions using R and Bioconductor", 2005)

"The percentage is one of the best" (mathematical) friends a journalist can have, because it quickly puts numbers into context. And it's a context that the vast majority of readers and viewers can comprehend immediately." (Charles Livingston & Paul Voakes, "Working with Numbers and Statistics: A handbook for journalists", 2005)

"Sparklines are word-like graphics, With an intensity of visual distinctions comparable to words and letters. [...] Words visually present both an overall shape and letter-by-letter detail; since most readers have seen the word previously, the visual task is usually one of quick recognition. Sparklines present an overall shape and aggregate pattern along with plenty of local detail. Sparklines are read the same way as words, although much more carefully and slowly." (Edward R Tufte, "Beautiful Evidence", 2006)

"Data visualization [...] expresses the idea that it involves more than just representing data in a graphical form (instead of using a table). The information behind the data should also be revealed in a good display; the graphic should aid readers or viewers in seeing the structure in the data. The term data visualization is related to the new field of information visualization. This includes visualization of all kinds of information, not just of data, and is closely associated with research by computer scientists." (Antony Unwin et al, "Introduction" [in "Handbook of Data Visualization"], 2008)

"Tables work in a variety of situations because they convey large amounts of data in a condensed fashion. Use tables in the following situations: (1) to structure data so the reader can easily pick out the information desired, (2) to display in a chart when the data contains too many variables or values, and (3) to display exact values that are more important than a visual moment in time." (Dennis K Lieu & Sheryl Sorby, "Visualization, Modeling, and Graphics for Engineering Design", 2009)

"For a visual to qualify as beautiful, it must be aesthetically pleasing, yes, but it must also be novel, informative, and efficient. [...] For a visual to truly be beautiful, it must go beyond merely being a conduit for information and offer some novelty: a fresh look at the data or a format that gives readers a spark of excitement and results in a new level of understanding. Well-understood formats" (e.g., scatterplots) may be accessible and effective, but for the most part they no longer have the ability to surprise or delight us. Most often, designs that delight us do so not because they were designed to be novel, but because they were designed to be effective; their novelty is a byproduct of effectively revealing some new insight about the world." (Noah Iliinsky, "On Beauty", [in "Beautiful Visualization"] 2010)

"All graphics present data and allow a certain degree of exploration of those same data. Some graphics are almost all presentation, so they allow just a limited amount of exploration; hence we can say they are more infographics than visualization, whereas others are mostly about letting readers play with what is being shown, tilting more to the visualization side of our linear scale. But every infographic and every visualization has a presentation and an exploration component: they present, but they also facilitate the analysis of what they show, to different degrees." (Alberto Cairo, "The Functional Art", 2011)

"Bear in mind is that the use of color doesn’t always help. Use it sparingly and with a specific purpose in mind. Remember that the reader’s brain is looking for patterns, and will expect both recurrence itself and the absence of expected recurrence to carry meaning. If you’re using color to differentiate categorical data, then you need to let the reader know what the categories are. If the dimension of data you’re encoding isn’t significant enough to your message to be labeled or explained in some way - or if there is no dimension to the data underlying your use of difference colors - then you should limit your use so as not to confuse the reader." (Noah Iliinsky & Julie Steel, "Designing Data Visualizations", 2011)

"Communication is the primary goal of data visualization. Any element that hinders - rather than helps - the reader, then, needs to be changed or removed: labels and tags that are in the way, colors that confuse or simply add no value, uncomfortable scales or angles. Each element needs to serve a particular purpose toward the goal of communicating and explaining information. Efficiency matters, because if you’re wasting a viewer’s time or energy, they’re going to move on without receiving your message." (Noah Iliinsky & Julie Steel, "Designing Data Visualizations", 2011)

"Graphics, charts, and maps aren’t just tools to be seen, but to be read and scrutinized. The first goal of an infographic is not to be beautiful just for the sake of eye appeal, but, above all, to be understandable first, and beautiful after that; or to be beautiful thanks to its exquisite functionality." (Alberto Cairo, "The Functional Art", 2011)

"A viewer’s eye must be guided to 'read' the elements in a logical order. The design of an exploratory graphic needs to allow for the additional component of discovery - guiding the viewer to first understand the overall concept and then engage her to further explore the supporting information." (Felice C Frankel & Angela H DePace, "Visual Strategies", 2012)

"Explanatory data visualization is about conveying information to a reader in a way that is based around a specific and focused narrative. It requires a designer-driven, editorial approach to synthesize the requirements of your target audience with the key insights and most important analytical dimensions you are wishing to convey." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"Infographics combine data with design to enable visual learning. This communication process helps deliver complex information in a way that is more quickly and easily understood. [...] In an era of data overload, infographics offer your audience information in a format that is easy to consume and share. [...] A well-placed, self-contained infographic addresses our need to be confident about the content we’re sharing. Infographics relay the gist of your information quickly, increasing the chance for it to be shared and fueling its spread across a wide variety of digital channels." (Mark Smiciklas, "The Power of Infographics: Using Pictures to Communicate and Connect with Your Audiences", 2012)

"Leading your reader to the watering hole will come with experience. You don’t want to get too far ahead of the reader, but you don’t want to fall behind the reader’s expectations. And remember, an audience - the reader - is not monolithic; your target audience has different levels of understanding, comprehension, and need. Use your voice judiciously to bring your readers along with you." (Steven Heller, "Writing and Research for Graphic Designers: A Designer's Manual to Strategic Communication and Presentation", 2012)

"Context (information that lends to better understanding the who, what, when, where, and why of your data) can make the data clearer for readers and point them in the right direction. At the least, it can remind you what a graph is about when you come back to it a few months later. […] Context helps readers relate to and understand the data in a visualization better. It provides a sense of scale and strengthens the connection between abstract geometry and colors to the real world." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context" (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"A great infographic leads readers on a visual journey, telling them a story along the way. Powerful infographics are able to capture people’s attention in the first few seconds with a strong title and visual image, and then reel them in to digest the entire message. Infographics have become an effective way to speak for the creator, conveying information and image simultaneously." (Justin Beegel, "Infographics For Dummies", 2014)

"If I had to pick a single go-to graph for categorical data, it would be the horizontal bar chart, which flips the vertical version on its side. Why? Because it is extremely easy to read. The horizontal bar chart is especially useful if your category names are long, as the text is written from left to right, as most audiences read, making your graph legible for your audience." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"The unique thing you get with a pie chart is the concept of there being a whole and, thus, parts of a whole. But if the visual is difficult to read, is it worth it?" (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"A well-designed graph clearly shows you the relevant end points of a continuum. This is especially important if you’re documenting some actual or projected change in a quantity, and you want your readers to draw the right conclusions. […]" (Daniel J Levitin, "Weaponized Lies", 2017)

"Too many simultaneous encodings will be overwhelming to the reader; colors must be easily distinguishable, and of a small enough number that the reader can interpret them. " (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"A map by itself requires little explanation, but once data are superimposed, readers will probably need labels on the maps, and legends explaining encodings like the color of markers." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Dashboards are collections of several linked visualizations all in one place. The idea is very popular as part of business intelligence: having current data on activity summarized and presented all inone place. One danger of cramming a lot of disparate information into one place is that you will quickly hit information overload. Interactivity and small multiples are definitely worth considering as ways of simplifying the information a reader has to digest in a dashboard. As with so many other visualizations, layering the detail for different readers is valuable." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Effective data scientists know that they are trying to convey accurate information in an easily understood way. We have never seen a pie chart that was an improvement over a simple table. Even worse, the creative addition of pictures, colors, shading, blots, and splotches may produce chartjunk that confuses the reader and strains the eyes." (Gary Smith & Jay Cordes, "The 9 Pitfalls of Data Science", 2019)

"One very common problem in data visualization is that encoding numerical variables to area is incredibly popular, but readers can’t translate it back very well." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"The term 'infographics' is used for eye-catching diagrams which get a simple message across. They are very popular in advertising and can convey an impression of scientific, reliable information, but they are not the same thing as data visualization. An infographic will typically only convey a few numbers, and not use visual presentations to allow the reader to make comparisons of their own." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Well-designed data graphics provide readers with deeper and more nuanced perspectives, while promoting the use of quantitative information in understanding the world and making decisions." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Clutter is the main issue to keep in mind when assessing whether a paired bar chart is the right approach. With too many bars, and especially when there are more than two bars for each category, it can be difficult for the reader to see the patterns and determine whether the most important comparison is between or within the different categories." (Jonathan Schwabish, "Better Data Visualizations: A guide for scholars, researchers, and wonks", 2021)

"Data visualization is a mix of science and art. Sometimes we want to be closer to the science side of the spectrum - in other words, use visualizations that allow readers to more accurately perceive the absolute values of data and make comparisons. Other times we may want to be closer to the art side of the spectrum and create visuals that engage and excite the reader, even if they do not permit the most accurate comparisons." (Jonathan Schwabish, "Better Data Visualizations: A guide for scholars, researchers, and wonks", 2021)

"Before even thinking about charts, it should be recognised that the table on its own is extremely useful. Its clear structure, with destination regions organised in columns and origins in rows, allows the reader to quickly look up any value - including totals - quickly and precisely. That’s what tables are good for. The deficiency of the table, however, is in identifying patterns within the data. Trying to understand the relationships between the numbers is difficult because, to compare the numbers with each other, the reader needs to store a lot of information in working memory, creating what psychologists refer to as a high 'cognitive load'." (Alan Smith, "How Charts Work: Understand and explain data with confidence", 2022)

"Scatterplots are valuable because, without having to inspect each individual point, we can see overall aggregate patterns in potentially thousands of data points. But does this density of information come at a price - just how easy are they to read? [...] The truth is such charts can shed light on complex stories in a way words alone - or simpler charts you might be more familiar with - cannot." (Alan Smith, "How Charts Work: Understand and explain data with confidence", 2022)

"When integrating written text with charts in a functionally aesthetic way, the reader should be able to find the key takeaways from the chart or dashboard, taking into account the context, constraints, and reading objectives of the overall message. " (Vidya Setlur & Bridget Cogley, "Functional Aesthetics for data visualization", 2022)

"Unlike text, visual communication is governed less by an agreed-upon convention between 'writer' and 'reader' than by how our visual systems react to stimuli, often before we’re aware of it. And just as composers use music theory to create music that produces certain predictable effects on an audience, chart makers can use visual perception theory to make more-effective visualizations with similarly predictable effects." (Scott Berinato, "Good Charts : the HBR guide to making smarter, more persuasive data visualizations", 2023)

25 December 2019

#️⃣Software Engineering: Mea Culpa (Part II: The Beginnings)

Software Engineering Series

I started programming at 14-15 years old with logical schemas made on paper, based mainly on simple mathematical algorithms like solving equations of second degree, finding prime or special numbers, and other simple tricks from the mathematical world available for a student at that age. It was challenging to learn programming based only on schemas, though, looking back, I think it was the best learning basis a programmer could have, because it allowed me thinking logically and it was also a good exercise, as one was forced to validate mentally or on paper the outputs.

Then I moved to learning Basic and later Pascal on old generation Spectrum computers, mainly having a keyboard with 64K memory and an improvised monitor. It felt almost like a holiday when one had the chance to work 45 minutes or so on an IBM computer with just 640K memory. It was also a motivation to stay long after hours to write a few more lines of code. Even if it made no big difference in what concerns the speed, the simple idea of using a more advanced computer was a big deal.

The jump from logical schemas to actual programming was huge, as we moved from static formulas to exploratory methods like the ones of finding the roots of equations of upper degrees by using approximation methods, working with permutations and a few other combinatoric tools, interpolation methods, and so on. Once I got my own 64K Spectrum keyboard, a new world opened, having more time to play with 2- and 3-dimensional figures, location problems and so on. It was probably the time I got most interesting exposure to things not found in the curricula.

Further on, during the university years I moved to Fortran, back to Pascal and dBASE, and later to C and C++, the focus being further on mathematical and sorting algorithms, working with matrices, and so on. I have to admit that it was a big difference between the students who came from 2-3 hours of Informatics per week (like I did) and the ones coming from lyceums specialized on Informatics, this especially during years in which learning materials were almost inexistent. In the end all went well.

The jumping through so many programming languages, some quite old for the respective times, even if allowed acquiring different perspectives, it felt sometimes like a waste of time, especially when one was limited to using the campus computers, and that only during lab hours. That was the reality of those times. Fortunately, the university years went faster than they came. Almost one year after graduation, with a little help, some effort and benevolence, I managed to land a job as web developer, jumping from an interlude with Java to ASP, JavaScript, HTML, ColdFusion, ActionScript, SQL, XML and a few other programming languages ‘en vogue’ during the 2000.

Somewhere between graduation and my first job, my life changed when I was able to buy my own PC (a Pentium). It was the best investment I could make, mainly because it allowed me to be independent of what I was doing at work. It allowed me learning the basics of OOP programming based on Visual Basic and occasionally on Visual C++ and C#. Most of the meaningful learning happened after work, from the few books available, full of mistakes and other challenges.

That was my beginning. It is not my intent to brag about how much or how many programming languages I learned - knowledge is anyway relative - but to differentiate between the realities of then and today, as a bridge over time.

Previous Post <<||>> Next Post

22 December 2019

💠SQL Server: Using the R Language in SQL Server 2016 (Hello World & Working with Data Frames) [new feature]

One of the most interesting features coming with SQL Server 2016 is the possibility to run external scripts written in the R language or Python, taking thus advantage of the numerical and statistical packages coming with the respective languages. The next examples are based on the R language.

As the scripts in R are considered as external scripts, is needed first to enable the 'external scripts enabled' configuration option by using the following script (a server restart is required):

-- enable external scripts 
sp_configure 'external scripts enabled', 1;
RECONFIGURE WITH OVERRIDE;

To makes sure that the functionality works as expected, it makes sense to attempt first a "hello world" example:

-- hello world script
EXECUTE sp_execute_external_script

        @language = N'R',  
        @script = N'print("Hello world")'

The R language come with a few predefined datasets and for the following examples I’ll be using the mtcars dataset. Because the dataset contains several columns, I will use only the first 3. For this one care run a script as the following into the R console:

Usually it’s useful to look first at the structure of the dataset, this by using str(mtcars) command in the R console:

To return the dataset from R one can use the following call to the sp_execute_external_script stored procedure:

  -- returning the first 3 columns 
 EXEC sp_execute_external_script  
       @language = N'R'  
     , @script = N'cars <- mtcars[1:3];'
     , @input_data_1 = N''  
     , @output_data_1_name = N'cars'
     WITH RESULT SETS (("mpg" float not null 
         , "cyl" float not null 
         , "disp" float not null 
       ));

As can be seen, besides the script is needed to define a variable in which the returning dataset is stored, as well the resulting dataset. Unfortunately, this script doesn’t return rows’ names. To do that I had to use a small trick by concatenating the initial data frame with the one resulting from row’s names. (Please let me know if you have another way of achieving the same.)

The script becomes:

 -- returning the first 3 columns including rows' name
 EXEC   sp_execute_external_script  
       @language = N'R'  
     , @script = N'cars <- data.frame(rownames(mtcars), mtcars[1:3]);'
     , @input_data_1 = N''  
     , @output_data_1_name = N'cars'
     WITH RESULT SETS (("Car" varchar(100)
      , "mpg" float not null 
      , "cyl" float not null 
      , "disp" float not null 
       ));

To reuse the script, it can be included in a stored procedure, similarly like the examples provided by the Microsoft documentation for the sp_execute_external_script stored procedure.

A few pointers:
1. It’s useful to test your scripts first in the R console.
2. If 'external scripts enabled' was not enabled, then the following error message will appear:

"Msg 39023, Level 16, State 1, Procedure sp_execute_external_script, Line 1 [Batch Start Line 22]
'sp_execute_external_script' is disabled on this instance of SQL Server. Use sp_configure 'external scripts enabled' to enable it."

3. It might be needed to start the “SQL Server Launchpad” service manually, a hint in this direction comes from the following error message:

"Msg 39011, Level 16, State 1, Line 24
SQL Server was unable to communicate with the LaunchPad service. Please verify the configuration of the service. "

4. Once the examples tested, it might be recommended to disable the 'external scripts enabled' configuration option as long is not needed anymore.
5. Check the various ways to analyse the mtcars dataset using the R language: https://rpubs.com/BillB/217355