21 May 2020

💎SQL Reloaded: Splitting a String (Before and After)

Starting with SQL Server 2016 Microsoft introduced the STRING_SPLIT table-valued function, which splits a string into rows of substrings, based on a specified separator character (e.g. “:”).


-- splitting a string (SQL Server 2016+)
SELECT *
FROM STRING_SPLIT('100001::M:black:', ':') 

The table-valued function object allowed also earlier to implement the same functionality, either by looping through the string or by using a common table expression. Here’s the implementation based on a loop (it was modified from a previous version to include an index):


-- split table-valued function with an index
CREATE FUNCTION [dbo].[SplitListWithIndex]( 
  @ListValues varchar(500) 
, @Delimiter char(1)) 
RETURNS @Temp TABLE(
  Ind int
, Value varchar(50)) 
AS 
BEGIN 
 DECLARE @Index int 
 DECLARE @Length int 
 DECLARE @Ind int 

 SET @Index = CharIndex(@Delimiter, @ListValues) 
 SET @Length = Len(@ListValues) - @Index 
 SET @Ind = 1
   
 WHILE @Index > 0 --if the fatch was successful
 BEGIN 
  INSERT @Temp 
  VALUES(@Ind, Substring(@ListValues, 0, @Index)) 

  SET @ListValues = Substring(@ListValues, @Index+1, @Length) 
  SET @Index = CharIndex(@Delimiter, @ListValues) 
  SET @Length = @Length - @Index 
  SET @Ind = @Ind + 1
 END  
   
 INSERT @Temp 
 VALUES(@Ind, @ListValues) 
RETURN 
END  
GO

The function could be called in code same as the STRING_SPLIT:

-- splitting a string (SQL Server 2000+)
SELECT *
FROM dbo.SplitListWithIndex('100001::M:black:', ':') 

The two functions are helpful when a column stores delimited values. It’s the case of Dynamics 365 which stores a SKU (Stock Keeping Unit) – the Product Numbers together with its Dimensions (ItemId, Configid, InventSizeId, InventColorId, StyleId) delimited by “:”, like in the above examples. Therefore, to parse the records one could write such code:


-- parsing delimited values (SQL Server 2000+)
SELECT DAT.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM ( VALUES ('100001::S:black:')
 , ('100001::M:black:')
 , ('100001::L:black:')
 , ('100001::XL:black:')
 , ('100001::S:white:')
 , ('100001::M:white:')
 , ('100001::L:white:')
 , ('100001::XL:white:')
) DAT (ProductNumber)
CROSS APPLY dbo.SplitListWithIndex(DAT.ProductNumber, ':') LV
GROUP BY DAT.ProductNumber
ORDER BY DAT.ProductNumber
 Similar output can be obtained via the STRING_SPLIT with the help of row_number() ranking window function introduced with SQL Server 2005:
-- parsing delimited values (SQL Server 2016+)
SELECT LV.ProductNumber
, Max(CASE WHEN LV.Ind = 1 THEN LV.Value END) ItemId
, Max(CASE WHEN LV.Ind = 2 THEN LV.Value END) Configid
, Max(CASE WHEN LV.Ind = 3 THEN LV.Value END) InventSizeId
, Max(CASE WHEN LV.Ind = 4 THEN LV.Value END) InventColorId
, Max(CASE WHEN LV.Ind = 5 THEN LV.Value END) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ind
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) LV
GROUP BY LV.ProductNumber
ORDER BY LV.ProductNumber

As can be seen the introduction of an index into the dbo.SplitListWithIndex function simplified the code, making the use of a ranking window function unnecessary. It would be useful for the STRING_SPLIT to provide the same, as this time of processing is pretty common.  
Here’s another example based on the PIVOT clause introduced also in SQL 2005:


-- parsing delimited values (SQL Server 2016+)
SELECT P.ProductNumber
, Cast(Trim([1]) as nvarchar(20)) ItemId
, Cast(Trim([2]) as nvarchar(20)) ConfigId
, Cast(Trim([3]) as nvarchar(20)) InventSizeid
, Cast(Trim([4]) as nvarchar(20)) InventColorId
, Cast(Trim([5]) as nvarchar(20)) StyleId
FROM (
 SELECT DAT.ProductNumber
 , XT.VALUE
 , ROW_NUMBER() OVER (PARTITION BY DAT.ProductNumber ORDER BY DAT.ProductNumber) Ranking
 FROM ( VALUES ('100001::S:black:')
  , ('100001::M:black:')
  , ('100001::L:black:')
  , ('100001::XL:black:')
  , ('100001::S:white:')
  , ('100001::M:white:')
  , ('100001::L:white:')
  , ('100001::XL:white:')
 ) DAT (ProductNumber)
 CROSS APPLY STRING_SPLIT(DAT.ProductNumber, ':') XT
) DAT
PIVOT (MAX(DAT.[VALUE]) FOR DAT.Ranking IN ([1],[2],[3],[4],[5])) P

Happy coding!

📦Data Migrations (DM): In-house Built Solutions (Part III: The Data Preparation Layer)

Data Migration
Data Migrations Series

Once the source data (including the data needed for enrichment) were made available into the migration database, one can model the source entities by encapsulating the logic in views or table-valued functions (TVFs). This enables code’s maintainability (by providing better visibility over the transformations), reuse (various validations are necessary), and performance (by taking advantage of RDBMS native functionality), and flexibility in changing the code. The objects thus created can be used in a new set of similar objects supposed to contain the mapping logic between the source and target data models. One attempts thus to prepare the data as needed by the target system. 

As each target entity is modelled, it’s useful to dump the resulting data into tables, which will be further used as source for the further logic, instead of using directly the views or TVFs. This allows to keep a copy of the data, perform a range of validations, and most important, can provide better performance as indexes can be built on the tables. In addition, one can further manipulate the data in tables as requested, e.g. by including information which are later available (e.g. attributes from the target system) or, for testing or correcting the data without affecting the built logic. From the same reasons such tables can be used in intermediate steps of the migration, inclusively when modelling the source data entities. However, one should avoid their excessive use, as this can complicate the architecture unnecessarily. 

The last step into the preparation layer is to prepare queries which only select the attributes needed by the target system and include additional formatting. It’s in general recommended to detach the formatting from other transformations as this approach provides better flexibility in addressing the migration requirements. The data can be afterwards exported manually or via an automated job, the latter approach being recommended especially when the data need to be partitioned. 

At this stage, after validating the data, they can be imported into the target system via the mechanisms available. Until here lies in theory the boundary of the migration logic, however this layer can be extended for data validation purposes. It would be helpful for example to assure that the data imported into the target entirely reflect the prepared data. It can happen that during import data are truncated, incorrectly imported (wrong attribute, values are changed or incorrectly mapped) or even whole records not being imported, with impact on data consistency. 

After importing the data into the target system one can import the migrated data via ETL packages back into the migration database and build queries which match the data at attribute level. This step may seem redundant, though it’s a way to assure that the migration occurred according to the expectations and minimize thus the later surprises. If not for all entities, this type of import might be the easiest solution for importing data into the logic (e.g. when identity values need to be mapped into the logic after migrating an entity).

It can be also helpful to import the tables for dropdown values (typical parameters), to assure thus that the values used for parameterizing the system were built into the migration logic as expected. It may sound surprising that not all systems perform checks by imports or these checks were disabled for other reasons. 

In data migrations is recommended to assure data internal consistency by design. Even if various validations for uniqueness, completeness, consistency, conformity, timeliness, referential integrity or even accuracy might seem as redundant or involve extra work, on the long run they pay-off as they allow trapping the issues early in the process.

📦Data Migrations (DM): In-house Built Solutions (Part II: The Import Layer)

Data Migration
Data Migrations Series

A data migration involves the mapping between two data models at (data) entity level, where the entity is a data abstraction modelling a business entity (e.g. Products, Vendors, Customers, Sales Orders, Purchase Orders, etc.). Thus, the Products business entity from the source will be migrated to a similar entity into the target. Ideally, the work would be simplified if the two models would provide direct access to the data through entities. Unfortunately, this is seldom the case, the entities being normalized and thus broken into several tables, with important structural differences. 

Therefore, the first step within a DM is identifying the business entities that make its scope from source and target, and providing a mapping between their attributes which will define how the data will flow between source and target. 

In theory, the source entity could be defined directly into the source with the help of views, if they are not already available. The problem with this approach is that the base data change, fact that can easily lead to inconsistencies between the various steps of the migration. For example, records are added, deleted, inactivated, or certain attributes are changed, fact that can easily make troubleshooting and validation a nightmare. 

The easiest way to address this is by assuring that the data will change only when actually needed. Is needed thus to create a snapshot of the data and work with it. Snapshots can become however costly for the performance of the source database, as they involve an additional maintenance overhead. Another solution is to make the snapshot via a backup or by copying the data via ETL functionality into another database (aka migration database). Considering that the data in scope make a small subset, a backup is usually costly as storage space and time, and is not always possible to take a backup when needed. 

An ETL-based solution for this provides an acceptable performance and is flexible enough to address all important types of requirements. The data can be accessed directly from the source (pull mechanism) or, when the direct access is an issue, they could be pushed to the migration database (push mechanism) or made available for load to a given location, then imported it into the migration database (hybrid mechanism). There’s also the possibility to integrate the migration database when a publisher/subscriber mechanism is in place, however such solutions raise other types of issues. 

One can import the tables 1:1 from the source for the entities in scope, attempt directly to model the entity within the ETL jobs or find a solution in between (e.g. import the base tables while considering joining also the dropdown tables). The latter seems to provide the best approach because it minimizes the numbers of tables to be imported while still reflecting the data structures from the source. An entity-based import addresses the first but not the second aspect, though depending on the requests it can work as well. 

In Data Warehousing (DW) there’s the practice to load the data into staging tables with no constraints on them, and only when the load is complete to move the data into the base tables which will be used as source for the further processing. This approach assures that the data are loaded completely and that the unavailability of the base tables is limited. In contrast to DW solutions is ideally not to perform any transformations on the data, as they should reflect the quality characteristics from the source. It's ideal to keep the data extraction, respectively the ETL jobs as simple as possible and resist building the migration logic already into this layer. 

💼Project Management: Project Planning (Part III: Planning Correctly Misunderstood III)

Mismanagement

One of the most misunderstood topics in Project Management seems to be the one of planning, and this probably because everyone has a good idea of what it means to plan an activity – we do it daily and most of the times (hopefully) we hit a bull’s-eye (or we have the impression we did that). You must do this and that, you have that dependency, you must coordinate with a few people, you must first reach that milestone before going further, you do one step at a time, and so on. It’s pretty easy, isn’t it?

From a bird’s eyes view project planning is like planning every other activity though there are several important differences. The most important one is of scale – the number of activities and resources involved, the level of coordination and communication, as well the quality with which occur, the level of uncertainty and control, respectively manageability. All these create a complexity that is hardly manageable by just one person. 

Another difference is the detail needed for the planning and targets’ reachability. Some believe that the plan needs to be done down to the lowest level of detail, which even if possible can prove to be an impediment to planning. Projects’ environment share some important characteristics with a battle field in terms of complexity of interactions, their dynamics and logistical requirements. Within an army’s structure there are levels of organization that require different mindsets and levels of planning. A general thinks primarily at strategic level in which troops and actions are seen as aggregations at the needed level of abstraction that makes their organization and planning manageable. The strategy is done however in collaboration with other generals and upper structures, while having defined the strategic goals the general must devise together with the immediate subalterns the tactics. In theory the project manager must regard the project from the same perspective. Results thus three levels of planning – strategic, done with the upper management, tactical done with the team members, respectively logistical, done within the team. That’s a way of breaking the complexity and dividing the responsibilities within the project. 

Projects’ final destination seem to have the character of a wish list more or less anchored in reality. From a technical point the target can be achievable though in big projects the most important challenges are of organizational nature – of being able to allocate and coordinate effectively the resources as needed by the project. The wish-like character is reflected also by the cost, scope, time triangle in respect to the expected quality – to some point in time one is forced to choose between two of them. On the other side, there’s the tendency to see the targets and milestones as fixed, with little room for deviation. One can easily forget that a strategic plan’s purpose is to set the objectives, identify the challenges and the possible lines of action, while a tactical plan’s objective is to devise the means to reach the objectives. Bringing everything together can easily obscure the view and, in extremis, the plan loses its actuality as soon was created (and approved). 

The most confusing aspect is probably the adherence of a plan to a given methodology, one dicing a project and thus a plan to fit a methodology by following blindly the rules and principles imposed by it instead of fitting the methodology to a project. Besides the fact that the methodologies are best practices but not necessarily good practices, what fits for an organization, they tend to be either too general, by specifying the what and not the how, or too restrictive (interpreted). 

20 May 2020

💼Project Management: Project Planning (Part V: Some Thoughts on Planning II)

Mismanagement

A project’s dependency on resources’ (average) utilization time (UT) and quality expectations expressed as a quality factor (QF) doesn’t come as a surprise, as hopefully one is acquainted with project’s triangle which reflects the dependency between scope, cost and time in respect to quality. Even if this dependency is intuitive, it’s difficult to express it in numbers and study the way it affects the project. That was the purpose of the model built previously.
From the respective model there are a few things to ponder. First, it’s a utopia to plan with 90% UT, unless one is really sure that the resources have enough work to bring the idle time close to zero. A single person can achieve maybe a 90% UT if he works alone on the project, though even then there are phases in which the input or feedback from other people is necessary. The more people involved into the project and the higher the dependency between their activities, the higher the chances that the (average) UT will decrease considerably.
When in addition there’s also a geographical or organizational boundary between team members, the UT will decrease even more. In consequence, in big projects like ERP implementations the team members from customer and vendor side are allocated fully to the project; when this is not possible, then on the vendor side the consultants need to be involved in at least two projects to cover the idle time. Of course, with good planning, communication, and awareness of the work ahead one can try minimizing the idle time, though that’s less likely to happen.
Probably, a better idea would be planning with 75% or even 60% UT though the values depend on team's experience in handling similar projects. If the team members are involved also in operational activities or other projects, then a 50% UT is more realistic.
Secondly, in the previous post was considered in respect to quality the 80%-20% rule which applies to the various deliverables, though the rule has a punctual character. Taken on the average the rule is somehow attenuated. Therefore, in the model was considered a sprung between factors of 1 to 2 with a step of 0,25 for each 5% quality increase. It's needed to prove whether the values are realistic and how much they depend on project's characteristics.
On the other side, quality is difficult to quantify, and 100% quality is hypothetical. One discusses in theory about 3 sigma (the equivalent of 93,3 accuracy) or 4 sigma (99,4 accuracy) in respect to the number of errors found in the code, though from there on everything is fuzzy. In software projects each decision has the potential of leading to an error, and there’s lot of interpretability as long there’s no fix basis against to compare the deviations. One needs precise and correct specification for that.
I think that one should target in a first phase 80% quality (on average) and further build from there, try to improve the quality iteratively as the project goes on and as lessons are learned. In other words, a project plan, a concept, a design document doesn’t need to be perfect from the beginning but should be good enough to allow working with it. One can detail them as progress is made into the project, and hopefully their quality should converge to a value that is acceptable for the business.
Thirdly, in case a planning tool was used, one can use the model backwards to roughly prove timeline’s feasibility, dividing the planned effort by the estimated effort and the number of resources involved to identify the implied utilization time.  

Query Patterns in SQL Server: Matrix Representation

In a previous post on Project Management (PM) I used Excel to represent the data into a matrix format. The same output can be obtained by using SQL in a simple query based on a CROSS JOIN between two numeric intervals created via a CTE within a table-valued function. For this purpose I used an old piece of code which displays the values within an interval given a step. It’s the typical loop with a specified step.
I have made mainly two changes to the old example - I defined the start, end and step to be numeric, while considering also an index into the output table. Here’s the modified function:

-- SQL Server 2005+: Numeric Interval 
CREATE FUNCTION dbo.fGetNInterval( 
  @Start numeric(18,2)
, @End numeric(18,2)
, @Step numeric(18,2)) 
RETURNS @Interval TABLE( 
  Ind int PRIMARY KEY
, Val numeric(18,2)) 
AS 
BEGIN 
 DECLARE @Sign smallint 
 SET @Sign = IsNull(NullIf(Sign(@End-@Start), 0), 1); 
 WITH CTE(Ind, Val) 
 AS( 
   SELECT Cast(1 as int) Ind
   , Cast(COALESCE(@Start, @End, 0) as numeric(18,2)) Val 
   UNION ALL 
   SELECT cast(Ind+1 as int) Ind
   , Cast(CTE.Val + @Step * @Sign as numeric(18,2)) Val 
   FROM CTE 
   WHERE (@End-CTE.Val)*@Sign-@Step>=0 
 ) 
 INSERT @Interval 
 SELECT Ind
 , Val 
 FROM CTE 
 OPTION(MAXRECURSION 0) 
 RETURN 
END 
As it can be seen, the function can be used to step forwards as well backwards:

--descending sequence 
SELECT UT.Val, 100/UT.Val
FROM dbo.fGetNInterval(100, 25, 5) UT

-- ascending sequence 
SELECT *
FROM dbo.fGetNInterval(1, 2, 0.25)
Based on these two sequences the matrix query can be written as follows:

-- matrix representation
DECLARE @EE as numeric(18,2) = 100
SELECT UT.Val [UT %]
, Max(CASE WHEN QF.Ind = 1 THEN Cast(QF.Val*@EE*100/UT.Val as numeric(18,1)) END) [80%]
, Max(CASE WHEN QF.Ind = 2 THEN Cast(QF.Val*@EE*100/UT.Val as numeric(18,1)) END) [85%]
, Max(CASE WHEN QF.Ind = 3 THEN Cast(QF.Val*@EE*100/UT.Val as numeric(18,1)) END) [90%]
, Max(CASE WHEN QF.Ind = 4 THEN Cast(QF.Val*@EE*100/UT.Val as numeric(18,1)) END) [95%]
, Max(CASE WHEN QF.Ind = 5 THEN Cast(QF.Val*@EE*100/UT.Val as numeric(18,1)) END) [100%]
FROM dbo.fGetNInterval(100, 25, 5) UT
, dbo.fGetNInterval(1, 2, 0.25) QF
GROUP BY UT.Ind
, UT.Val

Here’s the output:

Note:
The indexes were used to display the values into the needed format. The table-valued function could be created also without the respective indexes, however it would have complicated the query.

19 May 2020

💼Project Management: Project Planning (Part IV: Some Thoughts on Planning I)

Mismanagement

One of the issues in Project Management (PM) planning is that the planner idealizes a resource and activities performed by it much like a machine. Unlike machines whose uptime can approach 100%, a human resource can work at most 90% of the available time (aka utilization time), the remaining 10% being typically associated with interruptions – internal emails and meetings, casual communications, pauses, etc. For resources split between projects or operations the utilization time can be at most 70%, however a realistic value is in general between 40% and 60% on average. What does it mean this for a project?
So, if a resource has a volume of work W, the amount of time needed to complete the work would be at best W/UT, where UT is the utilization time of the respective resource. “At best” because in each project there are additional idle time resulted from waste related activities – waiting for sign off, for information, for other resource to complete the time, etc.
The utilization time is not the only factor to consider. Upon case, the delivered work can reach maybe on average 80% of the expected quality. This applies to documentation and concepts as well for written code, bug testing and other project activities. To reach in the range of 100% one more likely will need 4 times of the effort associated with reaching 80% of the expected quality, however this value is dependent also on people’s professionalism and the degree with which the requirements were understood and possibly achievable. Therefore, the values vehiculated can be regarded as “boundary” values.
Let’s consider a quality factor (QF) which has a value of 1 for 80%, with an increase of 0,25 for each 5% of quality increase. Thus, with an initial effort estimation of 100 days, this is how the resulted effort modifies for various UT and QF values:

Considering that a project can target between 60% and 95% UT, and between 80% and 95% quality, for an initial estimation of 100 days the actual project duration can range between 117 and 292 days, where the lowest, respectively the right bound values are more realistic.
The model is simplistic as it doesn’t reflect the nonlinear aspect of the factors involved and the dependencies existing between them. It also doesn’t reflect the maturity of an organization to handle the projects and the tasks involved. However, it can be used to increase the awareness in how the utilization time and expected quality can affect a project’s timeline, and to check on whether one’s planning is realistic.
For example, at project’s start one can target an UT of 70% and a quality of 85%, which for 100 days of estimated effort will result in about 178 days of actual effort. Now diving the value by the number of resources involved, e.g. 4, it results that the project could be finished in about 44,5 days. This value can be compared then with the actual plan in which the activities are listed.
During the project it would be useful to look on how the UT changed and by how much, to understand the impact the change has on the project. For example, a decrease of 5% in utilization time can delay the project with 2,5 days which is not much, though for a project of 1000 days with talk already about one month. Same, it will be helpful to check how much the quality deviated from the expectation, because a decrease in quality by 5% can result in an additional effort of extra 8 days, which for 1000 days would mean almost 4 months of delay.

📦Data Migrations (DM): In-house Built Solutions (Part I - An Introduction)

Data Migration
Data Migrations Series

A Data Migration (DM) is the move of all or a subset of the data available from one or more system(s) into other system(s). For ERP systems a DM this can be achieved by having a database in between which allows the import and combination of data from various sources, the further processing of data, respectively the preparation of the data as per target system(s)’ needs. 

At a deeper lever is needed to model the entities from source and the target systems at level that allows easier processing, which resumes to removing and adding data, making mapping and transformations between the input and output models. In other words, a set of entities is transformed in another set of entities, while at a much lower level attributes from the source system must be mapped to the target system. 

SQL scripting provides enough flexibility in modelling the source or target system(s), in changing the data and logic on the fly, in building additional functionality, in reusing logic or cleaning the data. It works with big amounts of structured data, while for unstructured data one can still combine the power of SQL with ad-hoc or even NoSQL processing. It all sounds easy, isn’t it? Then from where comes the complexity people talk about?

Knowing SQL and the features of a (relational) database (the how) is only a small part from all is needed. Building a DM solution implies the knowledge of the source data models and of the data behind (the what), knowing the target (where), how and when to import the data, why the data must be made available, in what way the data has impact on the target system(s), respectively who needs to be involved. The more of these aspects are known, the higher the chances for the DM to succeed. 

When there’s a gap in the needed knowledge, then the knowledge must be acquired from the business and/or consultants, by investing time in data discovery or similar activities, in experimenting and testing. The more uncertainty, the more iterations are needed to achieve the degree of certainty (aka quality) needed. At minimum one needs at least 3-4 iterations to migrate the data – to build the concept, to test fully the migration, respectively to get the sign-off in a test environment and doing the migration into the production. 

Another type of complexity is derived from the interdependencies existing between the DM and vital activities like system parameterization and data cleaning. System’s parametrization defines how a system behaves, an important number of the parametrization values being reflected also in the prepared data for the DM. Wrong parametrization values can be trapped in the process during the various iterations, however poor data quality will reflect though the system after Go-Live and will haunt the business on the long term, unless addressed correctly into the project. 

All these aspects are reflected to some degree also in DM’s architecture (e.g. building restart points, including data validation and automatic cleaning) and process (e.g. sequencing of steps). From all the effort involved in DM, maybe 20-30% needs to be invested in building the solution. These 20-30% can be reduced in theory by using specialized DM tools, though the advantage provided by such tools can be illusory, especially when further limitations are involved. On the other side such tools may provide functionality that address the other 70-80%. 

As usual in IT, there’s a trade-off between advantages and disadvantages. An in-house build DM solution may provide the needed flexibility in the detriment of a small time investment. Whether it pays-off is something one has to prove by walking this path. 

17 May 2020

Query Patterns in SQL Server: Window Functions

For a long time aggregate functions were the only tool for statistical purposes available for raw SQL scripting. Their limitations become more evident with the introduction of window functions, which allow to apply the aggregates over a defined partition. Besides aggregate window functions, the use of ranking, respectively value window functions, opened the door to a new set of techniques. Here are a few examples based on the tables defined in a previous post.

Ranking windows functions allow ranking a record within the whole dataset or a partition by providing a sorting key:

-- Ranking window functions (RANK, DENSE_RANK, ROW_NUMBER)
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
, A.StartDate 
, A.EndDate 
, RANK() OVER (ORDER BY A.Mark) [Rank over whole set]
, DENSE_RANK() OVER (ORDER BY A.Mark) [Dense Rank over whole set]
, ROW_NUMBER() OVER (ORDER BY A.Mark) [Row Number over whole set]
--, RANK() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Rank over Course]
--, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Rank over Student]
--, DENSE_RANK() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Dense Rank over Course]
--, DENSE_RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Dense Rank over Student]
--, ROW_NUMBER() OVER (PARTITION BY A.CourseId ORDER BY A.StartDate) [Dense Rank over Course]
--, ROW_NUMBER() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) [Dense Rank over Student]
FROM dbo.T_Allocations A
     JOIN dbo.T_Courses C
       ON A.CourseId = C.CourseId 
     JOIN dbo.T_Students S
       ON A.StudentId = S.StudentId 
ORDER BY C.CourseName 
, S.StudentName 

Being able to rank records allows to easier select the last, respectively the first records from a dataset or a given partition:

-- first course 
SELECT *
, A.StartDate 
FROM dbo.T_Students S
     LEFT JOIN (
		SELECT A.CourseId
		, A.StudentId 
		, A.StartDate 
		, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate) Ranking 
		FROM dbo.T_Allocations A
    ) A
   ON S.StudentId = A.StudentId 
  AND A.Ranking = 1 
ORDER BY S.StudentName

The aggregate window functions function similarly as their simple counterparts, except that they are valid over a partition. If the partition is left out, the functions apply over the whole data set.

-- aggregate window functions 
SELECT A.CourseId 
, C.CourseName 
, A.StudentId 
, S.StudentName 
, A.StartDate 
, A.EndDate 
, SUM(A.Mark) OVER (PARTITION BY A.StudentId) [Sum]
, MAX(A.Mark) OVER (PARTITION BY A.StudentId) [Max]
, MIN(A.Mark) OVER (PARTITION BY A.StudentId) [Min]
, AVG(A.Mark) OVER (PARTITION BY A.StudentId) [Avg]
, COUNT(A.Mark) OVER (PARTITION BY A.StudentId) [Count]
FROM dbo.T_Allocations A
     JOIN dbo.T_Courses C
       ON A.CourseId = C.CourseId 
     JOIN dbo.T_Students S
       ON A.StudentId = S.StudentId 
ORDER BY C.CourseName 
, S.StudentName 

When the ordering attributes are specified, running averages, respectively sums derive:

-- Running averages/sums and ranking via Sum and ORDER BY
SELECT A.StudentId 
, A.CourseId
, A.StartDate
, A.EndDate
, A.Mark 
, AVG(Cast(Mark as decimal(4,2))) OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) AvgGrade
, SUM(Mark) OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) SumGrade
, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) Ranking
FROM dbo.T_Allocations A
ORDER BY A.StudentId
, A.StartDate
Running averages were introduced only with SQL Server 2008. Previously, one could still calculate them using common table expressions and a ranking function:
-- Running averages/sums and ranking via common table expressions 
WITH CTE AS (
	SELECT A.StudentId 
	, A.CourseId
	, A.StartDate
	, A.EndDate
	, A.Mark 
	, RANK() OVER (PARTITION BY A.StudentId ORDER BY A.StartDate, A.AllocationId) Ranking
	FROM dbo.T_Allocations A
), 
DAT AS (
	SELECT A.StudentId 
	, A.CourseId
	, A.StartDate
	, A.EndDate
	, A.Mark 
	, A.Ranking
	, A.Mark SumMark
	FROM CTE A
	WHERE A.Ranking = 1
	UNION ALL
	SELECT CTE.StudentId 
	, CTE.CourseId
	, CTE.StartDate
	, CTE.EndDate
	, CTE.Mark 
	, CTE.Ranking
	, CTE.Mark + DAT.Mark SumMark 
	FROM CTE
		 JOIN DAT
		   ON CTE.StudentID = DAT.StudentId 
		  AND CTE.Ranking-1 = DAT.Ranking 
) 
SELECT DAT.StudentId 
, DAT.CourseId
, DAT.StartDate
, DAT.EndDate
, DAT.Mark 
, cast(DAT.SumMark/DAT.Ranking as decimal(4,2)) AvgMark
, DAT.SumMark
, DAT.Ranking
FROM DAT
ORDER BY DAT.StudentId
, DAT.StartDate
   

Query Patterns in SQL Server: Aggregate Functions

As their names denote, aggregate functions are used to summarize information across the whole data set or based on several attributes. The following queries exemplify the use of aggregates based on the tables created in a previous post:
-- aggreates over the whole table 
SELECT COUNT(*) NoRecords 
, MIN(StartDate) StartDate
, MAX(EndDate) EndDate
, DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
FROM dbo.T_Allocations A

-- aggregates based on several several attributes 
 SELECT A.CourseId
 , A.StudentId 
 , COUNT(*) NoRecords 
 , MIN(StartDate) StartDate
 , MAX(EndDate) EndDate
 , DateDiff(d, MIN(StartDate), MAX(EndDate)) [DiffDays]
 FROM dbo.T_Allocations A
GROUP BY A.CourseId
 , A.StudentId 

Correlated subqueries can be of help either inside of a SELECT or an CROSS, respectively OUTER APPLY: 

-- correlated subquery within a SELECT 
 SELECT *
 , (SELECT COUNT(*) NoRecords 
     FROM dbo.T_Allocations A
     WHERE C.CourseId = A.CourseId
     GROUP BY A.CourseId) NoRecords 
 FROM dbo.T_Courses C


 -- CROSS APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      CROSS APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

-- OUTER APPLY (correlated subquery)
 SELECT *
 , A.NoRecords
 , A.StartDate
 FROM dbo.T_Courses C
      OUTER APPLY (
         SELECT A.CourseId
         , COUNT(*) NoRecords 
         , MIN(StartDate) StartDate
         FROM dbo.T_Allocations A
         WHERE C.CourseId = A.CourseId
         GROUP BY A.CourseId
     ) A
 ORDER BY C.CourseName

Aggregates are useful in a series of techniques like the listing of duplicates:

-- identifying duplicates via an aggregate 
SELECT A.*
FROM dbo.T_Allocations A
     JOIN (
  SELECT A.CourseId
  , A.StudentId 
  , COUNT(*) NoRecords 
  FROM dbo.T_Allocations A
  GROUP BY A.CourseId
  , A.StudentId 
  HAVING COUNT(*)>1
  ) DUP
   ON A.CourseId = DUP.CourseId 
  AND A.StudentId = DUP.Studentid
ORDER BY A.CourseId
, A.StudentId 

A similar technique can be used to remove the duplicates from the base table:

-- removing the duplicates except the last record  
 DELETE dbo.T_Allocations 
 FROM (
 SELECT A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 , Max(A.AllocationId) AllocationId
 FROM dbo.T_Allocations A
 GROUP BY A.StudentId 
 , A.CourseId
 , A.StartDate
 , A.EndDate
 HAVING count(*)>1
  ) A
WHERE dbo.T_Allocations.CourseId = A.CourseId
  AND dbo.T_Allocations.StudentId = A.StudentId
  AND dbo.T_Allocations.StartDate = A.StartDate
  AND dbo.T_Allocations.EndDate = A.EndDate
  AND dbo.T_Allocations.AllocationId <> A.AllocationId

When the number of values is fixed and unique within a grouping, one can use aggregate functions to display the values within a matrix:

-- Matrix display via aggregates 
SELECT S.StudentId 
, S.StudentName 
, CASE WHEN A.Course1 = 1 THEN 'x' ELSE '-' END Course1
, CASE WHEN A.Course2 = 1 THEN 'x' ELSE '-' END Course2
, CASE WHEN A.Course3 = 1 THEN 'x' ELSE '-' END Course3
, CASE WHEN A.Course4 = 1 THEN 'x' ELSE '-' END Course4
, CASE WHEN A.Course5 = 1 THEN 'x' ELSE '-' END Course5
FROM dbo.T_Students S
     LEFT JOIN (
  SELECT A.StudentId 
  , Max(CASE WHEN A.CourseId = 1 THEN 1 ELSE 0 END) Course1
  , Max(CASE WHEN A.CourseId = 2 THEN 1 ELSE 0 END) Course2
  , Max(CASE WHEN A.CourseId = 3 THEN 1 ELSE 0 END) Course3
  , Max(CASE WHEN A.CourseId = 4 THEN 1 ELSE 0 END) Course4
  , Max(CASE WHEN A.CourseId = 5 THEN 1 ELSE 0 END) Course5
  FROM dbo.T_Allocations A
  GROUP BY A.StudentId
    ) A
   ON S.StudentId = A.StudentId 
ORDER BY S.StudentName 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.