20 June 2020

SQL Unloaded: When Queries Look Like Ducks

There’s an expression of abductive reasoning “if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck” used to identify something indefinite based on certain (definite) characteristics. When one looks at a query written by someone else, one tries to identify such characteristics. However, there are also situations in which simple queries are hard to interpret. For example, each time I see a SELECT DISTINCT in a query I ask myself whether the developer wanted to shortcut a GROUP BY, or just wanted to avoid duplicated data from the output. Without looking at the data in detail it’s almost impossible deciding between the two scenarios.

Let’s consider the following simple example based on a set of orders sent over the interface between two systems. Because of poor design, the first position was caught by the logic before entering the full order (the price from it is incorrect):

-- dropping the table
--DROP TABLE IF EXISTS dbo.T_Orders 

-- inserting the records 
SELECT *
INTO dbo.T_Orders 
FROM (VALUES ('P0001', 1,'20200618', '100001','S','black',20, 1.23)
, ('P0001', 1,'20200618', '100001','S','black',20, 1.22)
, ('P0001', 2,'20200618', '100001','S','black',5, 1.23)
, ('P0001', 3,'20200618', '100001','M','black',10, 1.24)
, ('P0002', 1,'20200619', '100001','S','black',5, 1.23)
, ('P0002', 2,'20200619', '100001','S','black',10, 1.22)) DAT (PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price)

-- adding a primary key
ALTER TABLE dbo.T_Orders 
ADD Id int identity(1,1) NOT NULL

-- reviewing the data
SELECT *
FROM dbo.T_Orders 
ORDER BY PurchId
, Pos

The following queries return the same results :


-- simple SELECT DISTINCT 
SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders

-- simple GROUP BY
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty

Both queries hide the fact that a duplicate exists. However, with a small change in the last query (adding a count) one can highlight that, while the first query doesn't allow this flexibility:

-- simple GROUP BY with Count
SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
, count(*) NoRecords
FROM dbo.T_Orders
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
-- HAVING count(*)>1

This small change can make an important difference when one needs to analyze the data, so at least from this perspective it might be worth to write the query as a GROUP BY from the beginning, especially when writing complex queries.

There are also nested SELECT DISTINCTS that attempt refining the selection in successive steps. Such curiosities seldom make sense:

-- nested SELECT DISTINCT
SELECT DISTINCT PurchId, Pos, ItemId
FROM (
 SELECT DISTINCT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
) DAT

To remove the duplicates and perform a sum one may choose another curiosity of query - the inner query removes the duplicates (some write it also as a DISTINCT), while in the outer query is done the sum:

-- double grouping to remove duplicates
SELECT PurchId, Pos, ItemId, SizeId, ColorId
, SUM(Qty) Total
FROM (--inner query
 SELECT PurchId, Pos, ItemId, SizeId, ColorId, Qty
 FROM dbo.T_Orders
 GROUP BY PurchId, Pos, ItemId, SizeId, ColorId, Qty
) DAT
GROUP BY PurchId, Pos, ItemId, SizeId, ColorId

Even if such queries (miraculously) work upon case, sooner or later they are predestined to fail. When doing operations on data whose quality is doubtful, one needs to select a strategy for removing the duplicates. The first step is to identify the attributes which make a record unique (e.g. PurchId and Pos), and using a sorting criteria, one can remove the duplicates via a window function like Rank, Dense_Rank or Row_Number (with small exceptions their use is interchangeable). Supposing that the Id attribute determinates the order in which the data were added, one can write a similar query:

-- removing duplicates via ranking partition window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , RANK() OVER (PARTITION BY PurchId, Pos ORDER BY Id DESC) Ranking 
 FROM dbo.T_Orders
) DAT 
WHERE Ranking=1

The problem is that such queries need to be repeated in each use. A simpler method is running a regular check for duplicates and marking the duplicates as inactive or pushing them into a separate table. Deleting the duplicates can work as well, though this approach can hit back like a boomerang if not addressed correctly.

To identify the duplicates one of the following approaches can be used:

-- duplicates via grouping in inner query
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     JOIN (-- duplicates
 SELECT PurchId, Pos
 , count(*) NoRecords
 FROM dbo.T_Orders DUP
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 
   ON DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 

-- duplicates via grouping in correlated query 
SELECT DAT.*
, DUP.NoRecords 
FROM dbo.T_Orders DAT
     CROSS APPLY (-- duplicates
 SELECT count(*) NoRecords
 FROM dbo.T_Orders DUP
 WHERE DAT.PurchId = DUP.Purchid 
   AND DAT.Pos = DUP.Pos 
 GROUP BY PurchId, Pos
 HAVING count(*)>1 
  ) DUP 


-- duplicates via count within aggregate window function 
SELECT *
FROM (
 SELECT PurchId, Pos, PurchDate, ItemId, SizeId, ColorId, Qty, Price
 , count(*) OVER (PARTITION BY PurchId, Pos) NoRecords 
 FROM dbo.T_Orders
) DAT 
WHERE NoRecords>1


Happy coding!

16 June 2020

💼Project Management: Project Planning (Part IV: Planning Correctly Misunderstood IV)

Mismanagement

The relatively big number of Project Management (PM) methodologies considered nowadays makes it more and more difficult to understand the world of PM and make oneself understood, in the context in which terminology is used in explanations that defy the logic, in which people are stubborn in persisting that their understanding is the ultimate truth and, that between white and black there are no degrees of gray. Between all PM concepts project planning seems to be the most misunderstood, and this probably because all the activities revolve around it, while each methodology brings its own planning philosophy. Each methodology comes with its own story, its own imaginative description of what a perfect plan is about.

Independently of the methodology used there are three levels of planning. At highest level, the strategic one, the project is put in the context of other strategic activities – other projects and initiatives, as well business operations, competing altogether for the same financial and human resources.  At this level are the goals identified and put the basis for the successful execution of the project, including establishing the ground and integrating the main aspects of a project – risk, quality and communication. Here is decided which projects will be considered, in which sequence, how and when resources will be assigned. 

A project plan is typically written and further executed by having the tactical horizon in mind – the individual engagement of resources and actions, the actual means to reach the objectives set at strategic level. It’s the level where the actual project plan is detailed, where activities are sequenced and prioritized. Here each methodology has its own approach – whether the planning is done per deliverable, work package or any other approach used to partition the activities. It’s the level at which the various teams are coordinated toward specific targets. Thus the manageable unit is the team and not the individual, the deliverables or the work packages and not the individual tasks.

The operational level equates with the execution of a project’s activities. Even if the project manager oversights the activities, it’s in team’s duties to plan the activities having the set deliveries in mind. The project manager doesn’t need to know all the details, though he should be updated on a timely manner on the progress, the eventual risks and opportunities that arise in each area. This requires continuous coordination on vertical as well horizontal level.

The project manager typically oscillates between the strategic and tactical views of a project, while the operational level appears in the view only when operational themes are escalated or further coordination is needed. Even if this delimitation is clear in big projects, in the small projects the three levels melt into each other. Therefore the sprung from small to big projects and vice-versa can create issues when the approach is not tailored to project’s size and its further characteristics.

Attempting to plan each activity in the project at the lowest level of detail obscures the view, the complexity of the project kicking back sooner or later. Maintaining such a detailed plan can become a waste of time on the long term. In extremis a resource is used to update a plan, which easily can become obsolete by the time all activities were reviewed. This doesn’t mean that the project plan doesn’t need to be updated regularly, though the pace can be decided on each project’s specifics.

Therefore, one of the most important challenges in projects is finding the appropriate level of detail for planning, and there’s no general rule that works for all projects. Typically the choices alternate between work packages and deliverables. 

15 June 2020

𖣯Strategic Management: Strategy Design (Part VIII: Quality Acceptance Criteria for Strategies and Concepts)

Strategic Management

Quality acceptance criteria for concept documents in general, and for strategies in particular, are not straightforward for all, behind the typical request of completeness hiding other criteria like flexibility, robustness, predictability, implementable, specificity, fact-based, time-boundedness, clearness, comprehensibility or measurability.

Flexible: once the strategy approved, one must be able to change the strategy as may seem fit, especially to address changes in risks, opportunities, goals and objectives, respectively the identification of new facts. A strategy implies a roadmap on how to arrive from the starting point to destination. As the intermediary or final destinations change, the strategy must reflect these changes (and it’s useful to document these changes accordingly). This also implies that the strategy must be periodically reviewed, the new facts accordingly analyzed and decided whether they must be part of the strategy.
Robust: a strategy must handle variability (aka changes) and remain effective (producing the desired/intended results).

Predictable: the strategy needs to embrace the uncertainty and complexity of the world. Even if one can’t predict the future, the strategy must consider the changes foreseen in the industry and technologies. Is not necessarily about imagining the future, even if this would be ideal, but to consider the current trends in the industry.

Implementable: starting with the goals and ending with the roadmap, the strategy must be realistic and address organization’s current, respectively future capabilities. If the organization need to acquire further capabilities, they need to be considered as well.

Specific: the strategy must address the issues, goals and objectives specific for the organization. As long these are not reflected in it, the strategy is more likely to fail. It is true that many of the issues and goals considered can be met in other organizations, however there are always important aspects that need to be made explicit.

Fact-based: the strategy must be based on facts rooted in internal or market analysis, however the strategy is not a research paper to treat in detail the various concepts and findings – definitions, summaries of the findings with their implications, and references to further literature are enough, if needed.

Time-bound:  in contrast to other concepts, the strategy must specify the timeframe considered for its implementation. Typically a strategy addresses a time interval of 3 to 5 years, though upon case, the interval may be contracted or dilated to consider business specifics. The strategy can further break down the roadmap per year or biannually.

Complete: the strategy must be complete in respect to the important topics it needs to address. It’s not only about filing out a template with information, the reader must get a good understanding of what the strategy is about. Complete doesn’t mean perfect, but providing a good enough description of the intent.

Clear: especially when there are competing interests, the strategy must describe what is in scope and what was left out. What was left out is as important as what is considered, including the various presumptions. A test of clearness is whether the why, how, who, when and by what means were adequately considered.

Comprehensible: the targeted audience must be able to read and understand the strategy at the appropriate level of detail or scope.

Measurable: the progress of a strategy must be measurable, and there are two aspects to consider. On one side the goals and objectives considered must be measurable by definition (see SMART criteria), while on the other, one must be able to track the progress and various factors related to it (e.g. implementation costs, impact of the changes made, etc.). Therefore, a strategy must include a set of metrics that will allow quantifying the mentioned aspects.

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

Business Intelligence

Self-Service BI
(SSBI) is a form of Business Intelligence (BI) in which the users are enabled and empowered to explore and analyze the data, respectively build reports and visualizations on their own, with minimal IT support. 

The Good: Modern SSBI tools like PowerBI, Tableau or Qlik Sense provide easy to use and rich functionality for data preparation, exploration, discovery, integration, modelling, visualization, and analysis. Moreover, they integrated the advances made in graphics, data storage and processing (e.g. in-memory processing, parallel processing), which allow addressing most of data requirements. With just a few drag-and-drops users can display details, aggregate data, identify trends and correlations between data. Slice-and-dice or passthrough features allow navigating the data across dimensions and different levels of details. In addition, the tools can leverage the existing data models available in data warehouses, data marts and other types of data repositories, including the rich set of open data available on the web.

With the right infrastructure, knowledge and skills users can better understand and harness the business data, using them to address business questions, they can make faster and smarter decisions rooted in data. SSBI offers the potential of increasing the value data have for the organization, while improving the time to value for data products (data models, reports, visualizations). 

The Bad: In the 90s products like MS Excel or Access allowed users to build personal solutions to address gaps existing in processes and reporting. Upon case, the personal solutions gained in importance, starting to be used by more users to the degree that they become essential for the business. Thus, these islands of data and knowledge started to become a nightmare for the IT department, as they were supposed to be kept alike and backed-up. In addition, issues like security of data, inefficient data processing, duplication of data and effort, different versions of truth, urged the business to consolidate such solutions in standardized solutions. 

Without an adequate strategy and a certain control over the outcomes of the SSBI initiatives, organization risk of reaching to the same deplorable state, with SSBI initiatives having the potential to bring more damage than the issues they can solve. Insufficient data quality and integration, unrealistic expectations, the communication problems between business and IT, as well insufficient training and support have the potential of making SSBI’s adoption more difficult.

The investment in adequate SSBI tool(s) might be small compared with the further changes that need to be done within the technical and logistical BI infrastructure. In addition, even if the role of IT is minimized, it doesn’t mean that IT needs to be left out of the picture. IT is still the owner of the IT infrastructure, it still needs to oversight the self-service processes and the flow of data, information and knowledge within the organization. From infrastructure to skillset, there are aspects of the SSBI that need to be addressed accordingly. The BI professional can’t be replaced entirely, though the scope of his work may shift to address new types of challenges.

Not understanding that SSBI initiatives are iterative, explorative in nature and require time to bring value, can put unnecessary pressure on those being part of it. Renouncing to SSBI initiatives without attempting to address the issues and stir them in the right direction hinder an organization and its employees’ potential to grow, with all the implication deriving from it.

The Ugly: Despite the benefits SSBI can bring, its adoption within organizations remains low. Whether it’s business’ credibility in own forces, or the inherent technical or logistical challenges, SSBI follows the BI trend of being a promise that seldom reaches its potential.

12 June 2020

🎡SSIS Project: Covid-19 Data

Introduction

I was exploring the Covid-19 data provided by the John Hopkins institute and I stumbled as usual on several data issues. Therefore I thought I could share some scripts and ideas in a post. 

The data from the downloaded files cover a timeframe between 22nd of January and 10th of June 2020 and reflect the number of confirmed cases, deaths and recoveries per day, country and state. As it seems the data are updated on a daily basis. Unfortunately, the data are spread over several files (one for each indicator), which makes their consumption more difficult than expected, though the challenges are minor. I downloaded for the beginning the following files from the above link:
time_series_covid19_recovered_global_narrow.csv
time_series_covid19_confirmed_global_narrow.csv
time_series_covid19_deaths_global_narrow.csv

Before attempting anything with the files, it's recommended to look over them to check whether column names are provided and the columns are properly named, respectively how the columns and the rows are delimited, or on whether other things can be observed during a first review. I needed for example to delete the second line from each file. 

Data Loading 

When starting this kind of projects, it's useful to check first project's feasibility in term of whether the data are usable. Therefore, I use first to import the data via the 'Import Data' wizard, following to decide later whether it makes sense to build a project for this. Right click on the database in which you'd like to import the data, then from 'Tasks' choose 'Import Data' to use the wizard:


As data source we will consider the first data file, therefore into the 'Choose a Data Source' step, browse for the file:


The 'Locale' might appear different for you, therefore let the default value, however you'll have to make sure that the following values are like in the above screenshot. Look over the 'Columns' section to check whether the formatting was applied correctly. The preview offers a first overview of the data:



It's useful to review the default data types defined by the wizard for each column within the 'Advanced' section. For example the first two fields could have more than the default of 50 characters. The length or the data type can be modified as needed: 


One can attempt in theory to get all the column definitions right from the beginning, though for the first attempt this is less important. In addition, without knowing data's definition, there's always the possibility for something to go wrong. Personally, I prefer loading the data as text and doing later the needed conversions, if necessary. 

Into the next step one can define the 'Destination', the database where the data will be loaded: 



After defining the source and destination is needing to define the mapping between the two. One can consider going with the table definition provided by the wizard or modify table's name directly in the wizard: 



By clicking on the 'Edit Mappings' one can review the mappings. As there's a one-to-one import, one can in theory skip this step. However, if there are data already into the table, one can delete the rows from the destination tables or append them directly to the destination. The first radio button is selected though, as the table will be created as well: 



With this being done, one can run the package as it is - just click 'Next':



If everything is ok, each step from the package will appear in green, in the end the number of records appears. 


It may look like many steps, though once one got used to using the wizard, the data are loaded in less than 5 minutes.

Data Discovery

Once the data loaded, it's time for data discovery - looking at the structure of the data and trying to understand their meaning. Before further using the data it's important to identify the attributes which identify uniquely a record within your dataset - in this case the combination between State, Country and Date. At least it should be unique, because the second query returned some duplicates for the Country 'Korea', respectively 'Sint Eustatius and Saba' which seems to be a region within 'Netherlands'.
-- looking at the data
SELECT top 1000 *
FROM [dbo].[time_series_covid19_recovered_global_narrow]


-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1

-- checking for distinct values  
SELECT [Country Region]
, count(*) NoRecords 
, SUM(TRY_CAST(Value as int)) NoCases
FROM [dbo].[time_series_covid19_recovered_global_narrow]
GROUP BY [Country Region]
HAVING count(*)>1
ORDER BY [Country Region]
The two duplicates are caused by the fact that a comma was used in a country, respectively a province's name, when the comma is actually used as delimiter. (Therefore it's better to use a sign like "|" as delimiter, as the chances are small for the sign to be used anywhere. Another solution would be to use quotes for alphanumeric values.)  Fortunately, the problem can be easily fixed with an update which uses the dbo.CutLeft, respectively dbo.CutRight functions defined in a previous post. Therefore the functions need to be created first within the same database before running the scripts
-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region]  LIKE '%int Eustatius and Saba%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = [Province State] + ', ' + [Country Region] 
, [Country Region] = [Lat]
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = dbo.CutRight([Intermediate Region Code], ',', 0)
 WHERE [Country Region]  LIKE '%Sint Eustatius and Saba%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%int Eustatius and Saba%'
A similar solution is used for the second problem: 
-- review the issue
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'

-- correct the data
UPDATE [dbo].[time_series_covid19_recovered_global_narrow]
SET [Province State] = '' 
, [Country Region] = Replace( [Country Region] + ', ' + [Lat], '"', '')
, [Lat] = [Long]
, [Long] = [Date]
, [Date] = [Value]
, [Value] = [ISO 3166-1 Alpha 3-Codes]
, [ISO 3166-1 Alpha 3-Codes] = [Region Code]
, [Region Code] = [Sub-region Code]
, [Sub-region Code] = dbo.CutLeft([Intermediate Region Code], ',', 0)
, [Intermediate Region Code] = Replace(dbo.CutRight([Intermediate Region Code], ',', 0), '"', '')
WHERE [Country Region] LIKE '%Korea%'

-- review data after correction
SELECT *
FROM [dbo].[time_series_covid19_recovered_global_narrow]
WHERE [Country Region] LIKE '%Korea%'
With this the data from the first file are ready to use. The data from the other two files can be loaded following the same steps as above. The tables not only that they have a similar structure, but they have the same issues. One can just replace the name of the tables into the scripts to correct the issues.   

Putting All Together  
As we need for analysis the data from all three tables, we could create a query that joins them together and encapsulate it within a view. The volume of data is neglectable, and even without an index the query can perform acceptably. However, as soon the number of data increases, it's useful to have only one table for consumption. Independently of the approach considered the query is similar.  As we made sure that the key is unique across all the data, we could write the query as follows:

-- combining the data together 
SELECT [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) [Lat]
, TRY_CAST([Long] as decimal(10,4)) [Long]
, [Date]
, Sum(Confirmed) Confirmed
, Sum(Death) Death
, Sum(Recovered) Recovered
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]
INTO [dbo].[time_series_covid19_global_narrow]
FROM (
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , TRY_CAST([Value] as int) Confirmed
 , 0 Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_confirmed_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , TRY_CAST([Value] as int) Death 
 , 0 Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_deaths_global_narrow]
 UNION ALL
 SELECT [Province State]
 , [Country Region]
 , [Lat]
 , [Long]
 , [Date]
 , 0 Confirmed
 , 0 Death 
 , TRY_CAST([Value] as int) Recovered
 , [ISO 3166-1 Alpha 3-Codes]
 , [Region Code]
 , [Sub-region Code]
 , [Intermediate Region Code]
 FROM [dbo].[time_series_covid19_recovered_global_narrow]
  ) DAT
GROUP BY [Province State]
, [Country Region]
, TRY_CAST([Lat] as decimal(10,4)) 
, TRY_CAST([Long] as decimal(10,4)) 
, [Date]
, [ISO 3166-1 Alpha 3-Codes]
, [Region Code]
, [Sub-region Code]
, [Intermediate Region Code]

-- reviewing the data
SELECT *
FROM [dbo].[time_series_covid19_global_narrow]

-- checking for duplicates 
SELECT [Province State]
, [Country Region]
, Date 
, count(*) NoRecords 
FROM [dbo].[time_series_covid19_global_narrow]
GROUP BY [Province State]
, [Country Region]
, Date
HAVING count(*)>1
If everything went smoothly, then the last query will return no records. As the latitude was given differently, it was needed to format and cut the values after 4 decimals. Before using the data is needed to do a few adjustments. As the data are incremental, adding up to the previous date, it's useful to calculate the increase between two consecutive days. This can be done via the LAG window function:

-- preparing the data for analysis in a view
CREATE VIEW dbo.v_time_series_covid19
AS
SELECT * 
, LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevConfirmed
, LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevDeath
, LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) PrevRecovered
, Confirmed-LAG(Confirmed,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseConfirmed
, Death-LAG(Death,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseDeath
, Recovered - LAG(Recovered,1,0) OVER (PARTITION BY [Province State] , [Country Region] ORDER BY Date) IncreaseRecovered
FROM [dbo].[time_series_covid19_global_narrow]

With this the data are ready for consumption:

-- sample query
SELECT *
FROM dbo.v_time_series_covid19
WHERE [Country Region]  LIKE '%China%'
  AND [Province State] LIKE '%Hubei%'
ORDER BY DATE

Of course, one can increase the value of this dataset by pulling further information, like the size of the population, the average density, or any other factors that could have impact on the propagation of the disease.

Instead of loading the data via the wizard, one can create an SSIS project instead, however some of the corrections still need to be done manually, unless one includes into the logic the corrections as well. 

Happy coding!

11 June 2020

🧭🪄☯Business Intelligence: SQL Server Reporting Services (The Good, the Bad and the Ugly)

Business Intelligence

SQL Server Reporting Services (SSRS) is the oldest solution from the modern Microsoft BI stack. Released as add-on to SQL Server 2000, it allows covering most of an organization's reporting requirements, either if we talk about tables, matrices or crosstab displays, raw data, aggregations, KPIs or visualizations like charts, gauges, sparklines, tree maps or sunbursts.

The Good: Once you have a SQL query based on any standard data sources (SQL Server, Oracle, SharePoint, OData, XML, etc.), it can be used in just a few minutes to create a report with the help of a wizard. Sure, adding the needed formatting, parameters, custom code, drilldown and drill-through functionality might take some effort, though in less than an hour you have a running report. The use of templates and a custom branding allows providing a common experience across the enterprise. 

The whole service is available once you have a SQL Server license, fact that makes from the SSRS a cost-effective tool. The shallow learning curve and the integration with SharePoint facilitates the development and consumption of reports.

With its pixel-accurate display of data, SSRS is ideal for printing business documents. This was probably one of the reasons why SSRS become with Microsoft Dynamics AX 2009 also the main reporting platform for the further versions. One can use an AX 2009 class as source for the report, or directly use the base tables, which can increase reports’ performance in the detriment of reengineering the logic from AX 2009. With a few exceptions in finance area the reporting logic is easy to build.  

With SQL Server 2016 it got a HTML5 rendering engine, while with SSRS 2017 it supports a responsive web design. The integration of the SSRS and Power BI environments has the chance to further extend the value provided by this powerful combination, however it depends also in which direction Microsoft will develop this idea.   

The Bad: One of the important downsides of SSRS is that it doesn’t allow custom authentication. Even if some examples exist on the Web, it’s hard to understand Microsoft’s stubbornness of not providing this by design. 

Because SSRS still uses an older MS Office driver, it allows exporting only 65536 records to Excel, fact that makes data consumption more complicated. In addition, the pixel-perfect isn’t that perfect, the introduction of empty columns when exporting to Excel, adds some unnecessary burden.

In total, the progress made by SSRS between the various releases is small when compared with the changes suffered by SQL Server. Even if the visualization capabilities cover most of the requests, it loses field when compared with Power BI and similar visualization tools. 

The Ugly: SSRS, as the typical BI developer knows it, is different than the architecture frameworks provided when working with Business Central, respectively Dynamics 365 and CRM. Even if there are maybe entitled reasons, Microsoft failed to unite the three architectures into a flexible solution. Almost all the examples available on the Web target CRM, and frankly it’s hard to understand that. It feels like Microsoft wants to sabotage their own product?! What’s hard to understand is that besides SSRS and Power BI Microsoft has several other reporting tools for Dynamics 365. Building reports for Business Central or Dynamics 365 requires certain skills, while the development time increased considerably, thus SSRS losing from the appeal it previously had, allowing other tools to join the landscape (e.g. electronic documents).

SSRS can’t be smoothly integrated with Office 365 Online, remaining mainly a solution for on-premise architectures.  This can become a bottleneck when the customers move to the cloud, the BI strategy needing to be eventually rethought as well. 

24 May 2020

🧊🎡☯Data Warehousing: SQL Server Integration Services (The Good, the Bad and the Ugly)

Data Warehousing

Microsoft SQL Server Integration Services (SSIS) is a platform for building (enterprise-level) data integrations and data transformations solutions by using a rich set of built-in tasks and transformations, graphical tools for building packages, respectively a catalog for storing the packages. Formally called Data Transformation Services (DTS), it was introduced with SQL Server 2000 and with SQL Server 2005 it was rebranded as SSIS.

The Good: Since its introduction it was adopted by DBAs and (database) programmers because it allowed the import and export of data on the fly from and to SQL Server, flat files, other relational data sources, in fact any resource exposing a driver for ODBC or OLEDB libraries. The extract/load functionality was extended by a basic set of transformations, making from DTS the ideal ETL tool for data warehousing and integrations. The data from multiple sources and targets could be processed in parallel or sequentially, the ETL logic being encapsulated in one or more packages that could be run manually or scheduled via the SQL Server agent flexibly.

With SQL Server 2005 and further versions the SSIS framework was extended to support further data sources including XML, CAML-based SharePoint lists, OData, Hadoop or Azure Bloob. It allowed the storage of packages on the local storage or within the built-in catalog.

One could thus develop rich ETL functionality without writing a single line of code. In theory the packages could be run and modified also by non-IT users, which can be a plus in certain scenarios. On the other side one could build custom packages programmatically from the beginning, and thus extend the available data processing logic as seemed fit, being able to using existing code and whole libraries embedded into the packages or run via dlls calls .

The Bad: Despite the rich functionality, a data pipeline usually has a lower performance and is more difficult to troubleshoot compared with the built-in RDBMS functionality for data processing. Most, if not all transformations can be handled over SQL-based queries more efficiently as long the data are available on the same SQL Server instance. In addition, SQL provides better code reuse, maintainability, chances for refactoring, scalability and the solutions are easier to deploy. Therefore, one practice resumes in using SSIS only for import/export, the further logic being encapsulated into stored procedures and further database objects. This isn’t necessarily bad, on contrary, though specific expertise is needed then to modify the code.

The Ugly: SSIS is in general suitable for data warehousing and integrations solutions whose logic is ideally stable and well-defined. Therefore, SSIS is less suitable for ERP data migrations or similar task which at least at the beginning have an exploratory nature and an overwhelming complexity, multiple iterations being needed before the requirements were fully identified and understood. In extremis each iteration can involve a redesign, which can prove to be time-consuming. One could in theory attempt first understanding all the data, though this could mean starting the development late in the process, while the data for testing are required much earlier. One can still use SSIS for specific tasks, though implementing a whole solution could imply certain challenges that otherwise could have been avoided.

SSIS is not suitable for real-time complex data integrations which require the processing of a considerable amount of data, when specific architectures like SOA, Restful calls or other solution could be more efficient. When not adequately implemented a data integration can lead to more problems than it can solve. Best example is the increase in execution time with the volume of data, fact that can easily lead to time-outs and locking of data.

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.