20 June 2010

💎SQL Reloaded: Troubleshooting Query Issues III (Logical errors)

    In last post I provided some general guidelines on how to troubleshoot query issues related to errors thrown by the database engine. That’s a fortunate case, because engine’s validation shows that something went wrong. I’m saying fortunate because there are also cases in which there are also logical errors that result in unexpected output, and the errors could remain even for years undiscovered. It’s true that such errors are hard to discover, but not impossible; adequate testing combined with defensive programming techniques could allow decreasing the volume of such errors. The simplest and handiest test could be performed by looking in a first phase at the number of records returned by the query, and secondly by studying the evolution of cardinality with each table added to the query. The clearest signal that something went bad is when no records, too less or too many records are returned (than expected), then the problem most probably resides in the constraints used, namely in JOIN, WHERE or HAVING clause constraints.

WHERE clause constraints

    If your query is just a simple statement without joins of any type, then more likely the problem resides in the constraints used in the WHERE clause, at least one of them being evaluated to false. Here are several possible reasons:
- One of the hardcoded values or one of the provided parameters is not found between the values available in the table;
- Not treated NULL cases;
- Wrong use of predicates or parenthesis.

    The simplest constraint evaluated to false is “0=1” or something similar, actually can be used any similar values (e.g. ‘A’=’B’, ‘ABC’=’’, etc.). In contrast “1=1” is evaluated always to true. The below query returns no records because the constraint is evaluated for false:
 
-- single constraint evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 

    A more complex scenario is when multiple predicates and at least one equates always to true or to false, the task in such situations being to identify such constraints:

-- multiple constraints, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND Class = 'L' 
    AND SafetyStockLevel >100 

    The constraints could be grouped also using parenthesis:

-- multiple constraints with parenthesis, one evaluated to false 
SELECT * 
FROM Production.Product 
WHERE 1=0 
    AND ((Class = 'L' 
    AND SafetyStockLevel >100) 
      OR (Class = 'M' 
    AND SafetyStockLevel =100)) 

    Dealing with multiple such constraints is not a question of guessing but pure applied Boolean algebra essential in writing accurate queries. So if you are feeling that you’re not mastering such concepts, then maybe it’s a good idea to consult some material on this topic.

    There could be situations in which the total outcome of a set of constraints it’s not so easy to see, in such cases the constraint could be brought in the SELECT statement. Which constraints should be tested depends also on the particularities of your query, here is an example:  

--testing constraints' outcome for each record 
SELECT * 
, CASE  
     WHEN Class = 'L' AND SafetyStockLevel >100 THEN 1 
     ELSE 0 
  END Constraint1  
, CASE  
     WHEN Class = 'M' AND SafetyStockLevel =100 THEN 1 
     ELSE 0 
  END Constraint2 
, CASE  
     WHEN (Class = 'L' AND SafetyStockLevel >100) 
        OR (Class = 'M' AND SafetyStockLevel =100) THEN 1 
     ELSE 0 
END Constraint3 
FROM Production.Product  

      
  The constraints in which one of the members is NULL and the IS NULL or Null functions are not used, are incorrect evaluated, actually ignored, and in certain cases they might look even correct, though this depends also on the expressions used. For example the below query will return no records, while the next one will return some records. For more on the correct handling of NULLs see Null-ifying the world or similar posts on the web.

--misusing NULL-based constraints 
  SELECT * 
FROM Production.Product 
WHERE (NOT 1=NULL) OR (1=NULL) 
 
JOIN constraints

    JOINs based on single JOIN constraints should in theory pose no problem unless the wrong attributes are used in the JOIN and that’s so easy to do especially when no documentation is available on the topic or it is incomplete. Functional or other technical specifications, physical or semantic models, metadata or other developers’ knowledge could help the developer to figure out about the relations between the various entities. On the other side it’s in human nature to make mistakes, forgetting to add one constraint, using the wrong type of join or the wrong attributes in the join constraint. In addition all the situations described above for WHERE clauses constraints apply also to join constraints, independently on whether the ANSI or non-ANSI syntax is used, the later based on writing the JOIN constraints in the WHERE clause, making the identification of the JOIN constraints not so easy to see, this being one of the reasons for which I recommend you to use ANSI syntax.

    A technique I use on a daily basis in order to test my scripts is to check the changes in the number of records with the adding of each new table to the initial query. In a first phase is enough to count the number of records from the table with the lowest level of detail, and the number of records from the final query, this ignoring the WHERE constraints. For example let’s consider the following query written some time ago (most of attributes were removed for convenience):

-- Purchase Orders 
 SELECT POD.PurchaseOrderDetailID 
, POD.PurchaseOrderID 
, POD.ProductID  
, POH.ShipMethodID 
, POH.VendorID 
FROM Purchasing.PurchaseOrderDetail POD 
     JOIN Purchasing.PurchaseOrderHeader POH 
       ON POD.PurchaseOrderID = POH.PurchaseOrderID 
           JOIN Purchasing.ShipMethod PSM 
              ON POH.ShipMethodID = PSM.ShipMethodID 
           JOIN Purchasing.Vendor SMF 
              ON POH.VendorID = SMF.VendorID  
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' 
 
    Usually I write the table with the lowest level of detail (in theory also with the highest number of records) first, adding the referenced table gradually, starting with the tables with the lowest number of records, thus the lookup tables will come first. This usually doesn’t affect the way the database engine processes the query (unless special techniques are used), and it allows to use kind of a systematic approach. This allows me also to test the query without making important changes. As I wrote above, the first test is done against the first table, for this commenting the other joins and eventually the WHERE constraints:

-- testing changes in the number of records - test 1 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
/* JOIN Purchasing.PurchaseOrderHeader POH 
     ON POD.PurchaseOrderID = POH.PurchaseOrderID 
          JOIN Purchasing.ShipMethod PSM 
             ON POH.ShipMethodID = PSM.ShipMethodID 
          JOIN Purchasing.Vendor SMF 
             ON POH.VendorID = SMF.VendorID 
      JOIN Production.Product ITM 
         ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100 
AND ITM.Class = 'L' */ 
       
    In the second step is enough to check the number of records returned by the whole query without the WHERE constraints:

-- testing changes in the number of records - test 2 
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD  
     JOIN Purchasing.PurchaseOrderHeader POH 
        ON POD.PurchaseOrderID = POH.PurchaseOrderID 
            JOIN Purchasing.ShipMethod PSM 
               ON POH.ShipMethodID = PSM.ShipMethodID 
            JOIN Purchasing.Vendor SMF 
               ON POH.VendorID = SMF.VendorID  
      JOIN Production.Product ITM 
          ON POD.ProductID = ITM.ProductID 
/* WHERE POD.StockedQty >100 
     AND ITM.Class = 'L' */       

      
    If the number of records is correct then most probably the query is correct, this because there could be incorrect LEFT JOINS that are not reflected in the number of records but in the fact that the corresponding attributes are missing. So don’t rely entirely on this type of test, but take 1-2 examples for which you are sure that records must be retrieved from the LEFT JOIN, and check whether the expected values are shown. Eventually, only for testing purposes, the LEFT JOIN could be modified as a FULL JOIN to see what records are returned.  

    Now if there are changes in the number of records from one test to the other, then take the first query and move the comment one join further (see below), and repeat the step until the JOIN that causes the variances is identified. 

-- testing changes in the number of records - test 3  
SELECT count(1) NumberRecords 
FROM Purchasing.PurchaseOrderDetail POD 
JOIN Purchasing.PurchaseOrderHeader POH 
ON POD.PurchaseOrderID = POH.PurchaseOrderID 
/*      JOIN Purchasing.ShipMethod PSM 
           ON POH.ShipMethodID = PSM.ShipMethodID 
         JOIN Purchasing.Vendor SMF 
            ON POH.VendorID = SMF.VendorID 
     JOIN Production.Product ITM 
        ON POD.ProductID = ITM.ProductID 
WHERE POD.StockedQty >100       
AND ITM.Class = 'L' */          

 Once the join that causes the variances is found, then might be requested then to add a new join constraint or use a LEFT JOIN instead of a FULL JOIN. For example if in the above query there are ShipMethodID with NULL values, then the query should have been written using a LEFT JOIN. On the other side, if duplicates are found, then do a grouping by using the primary key of the table with the lowest level of detail, in this case the Purchasing.PurchaseOrderDetail.  

HAVING clause constraints

    HAVING clause constraints behave more like WHERE clause constraints, the difference residing in the different levels where the constraints are applied, so the same technique could be applied in this case too, with the difference that the constraints need to be accommodated to support aggregate functions.  

--Vendors with total purchase value >100000 
SELECT POH.VendorID 
, SUM(POH.SubTotal) TotalSubTotal 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 

    Sometimes it helps to bring the HAVING constraints into the SELECT, though that offers limited functionality, therefore it’s more productive maybe to re-link the aggregated query back to the base table(s) on which the statement is based:

--Vendors with total purchase value >100000 - base records 
SELECT A.* 
, B.TotalSubTotal 
, CASE 
     WHEN B.VendorID IS NOT NULL THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader A 
LEFT JOIN ( --aggregated data 
      SELECT POH.VendorID 
      , SUM(POH.SubTotal) TotalSubTotal 
      FROM Purchasing.PurchaseOrderHeader POH 
      WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete  
      GROUP BY POH.VendorID 
HAVING SUM(POH.SubTotal)>100000 
) B 
     ON A.VendorID = B.VendorID 
WHERE A.Status IN (2,4) -- 2-Approved, 4-Complete  

  
 
  Also this technique could prove to be limited, though I found it useful in several cases, especially when a complicated formula is included in the aggregated function. The same query could be rewritten with a window aggregate function introduced with SQL Server 2005.

--Vendors with total purchase value >100000 - window aggregate function 
SELECT POH.* 
, SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID) TotalSubTotal 
, CASE 
     WHEN SUM(POH.SubTotal) OVER (PARTITION BY POH.VendorID)>100000 THEN 1 
     ELSE 0 
END ConsideredFlag 
FROM Purchasing.PurchaseOrderHeader POH 
WHERE POH.Status IN (2,4) -- 2-Approved, 4-Complete 
 
    When working with amounts, quantities or other number values, it could be useful to check the total amount/quantity based on the query logic against the total amount/quantity based on the base table.

💎SQL Reloaded: Troubleshooting Query Issues II (Errors)

Some weeks ago I wrote a post on troubleshooting, looking at the topic from problem solving perspective. In what concerns the troubleshooting of SQL scripts, now it depends also on the type of DML statement used or on the type of issues the developer deals with – errors thrown by the database engine, wrong number of records returned by a query, performance, accessibility or reusability related issues.

Many of the errors come normally from wrong usage of syntax and functionality or from misspells, in many situations the database query engine providing enough information, leading the developer closer to issue’s resolution, at least by narrowing the range of search, and here the problem solving approach could be useful. In such situations it is useful to do a quick search in the local/online documentation or resources using the error number or the full/partial error message. I know that’s not rocket science, though I’ve seen people acting like hypnotized by the error message, not knowing what do to next. A quick search using a powerful search engine like Google.com, Bing.com or Yahoo.com could lead you to the solution or help gathering more data about the possible causes. More likely that there was somebody in the same situation and has posted a question or even the answer somewhere in a forum or a blog. It’s always a good idea and time-effective to do a little research before posting a new question in a forum. Especially between beginners there are people who don’t know to make use of the documentation, help tools or even a search engine, therefore it could be useful for them to take some time and look on some tips on how to make better use of such tools.

In general when an error is thrown is provided an error number that has a specific meaning for the vendor, a general description, and in fortunate cases also the line where it occurs. For example in SQL Server the execution of the below query based on AdventureWorks database will return the next error message:

SELECT * 
FROM dbo.Purchasing.PurchaseOrderHeader POH 
"
Msg 208, Level 16, State 1, Line 1
Invalid object name 'dbo.Purchasing.PurchaseOrderHeader'"
Even if in this case the error was created deliberately by using the name of a schema (e.g. dbo) instead of database’s name, this doesn’t mean that the cause/solution is so simple. For example the same error message could be thrown by an issue related to KB837970, KB310882, KB289551 and several other articles that could be found by using the following search text in Google:
site:support.microsoft.com Microsoft.com "Msg 208, Level 16"

Unfortunately in what concerns the errors appearing in queries, the localization of errors is not so clean, depending also on the order in which the query is parsed by the engine, thus the issue might be shown as occurring at a line other than its actual location, namely at the line where the query parsing finds an issue.
  
When the error message is in other languages than English, it makes sense to try to find the corresponding error message in English and use it in search because, as a quit important volume of documentation and other type of information are in English, the chances to find something are higher. Normally that could be done by using the error code, which in theory should have the same meaning independently of the language used.

Sometimes, when you didn’t knew how to do something and just used your inspiration and tried to use some functionality, and you are not fully aware of its implications and constraints, then it makes sense to consult the documentation or do a search on the web on how to do a certain task (e.g. How to load data from Oracle to SQL Server). 10 minutes spent in research could help you save a few hours later, so don’t jump blindly into coding! In many situations it could happen that a solution is already available, and if you find already the code, please consider not using it blindly, try to understand the how and why, its limitations and how it could be eventually improved, and some sorrow testing might be necessary too. 

Don’t try to reinvent the wheel just for the sake of it, in what concerns programming there are more likely others who had the same problem and found a neat solution to it. This doesn’t mean that you’ll have to do that for every problem you’ll have, but for tasks having higher complexity than the tasks you are accustomed with, the need for that decreasing with the level of expertise. Doing some research on a given topic can be time consuming, but could prove to be useful especially when other developers used other technique. Actually once you understood the idea, you could attempt to reconstruct the logic with your own coding and styling conventions, of course, not always it makes sense to do that.

Note:
You’ll have to consider also copyright infringements, but in theory the developers who posted code on the web for further reuse, want only some minimal credit for that (e.g. posting a link to the website where the code is found, mentioning developers’ name, etc.), while in some situations you could reuse the code only if you add to it substantial changes. Both statements apply not only to code but also other type of work (content) available freely on the web (e.g. articles, tutorials, etc.).

In case the search was unsuccessful then it makes sense to post your question on a forum, specifically in the area closest to your problem. Posting a question in the wrong forum has quite high chances to lead nowhere! Some forum members are quite critical about the users who post questions without doing some research, especially when there is already an answer available on the same forum. On forums, users have to be aware that they are not only wasting their time, but also readers’ time. On the other side, the more the points where the same question is posed, the higher are the chances for others to reach to them, and eventually get an answer, this depending also on how search engines index the web pages. 
 
When the same error appears for the second time, in theory the developer should know already how to solve it or the possible cause(s). Because the solutions are not always so simple to remember as is in human nature to forget, it makes sense to document the issues by describing the problem and the solutions (preferably with code) to such a level that it could understood also by other professionals.

When the number of errors occurring in your SQL scripts is quite high, especially when the same type of errors occurs, then maybe it makes sense to take a step back and ask yourself why that happens. Sometimes it just means that is time to take a break, though, more obvious, is maybe needed to review the theory/documentation, make use of defensive programming techniques or use an editor with powerful IntelliSense functionality that checks the syntax and highlights the errors as you type. If you are a beginner, don’t worry, with time coding and troubleshooting becomes much easier.

All the above ideas are simple common knowledge between developers, though such knowledge comes from (prolonged) experience – a combination of learning by doing and solving issues – “the school of hard knocks”. I just hope the above lines and the coming posts will be helpful to you.

03 June 2010

🧭Business Intelligence: Enterprise Reporting (VIII: Addressable Questions in Reports’ Creation I)

Business Intelligence
Business Intelligence Series

The creation of a (data) report of any type involves most of the times several iterations until all the requirements are gathered, the query and eventually the final report are created, the reports gets tested, the issues related to deliveries and requirements mitigated, etc. Typically the steps involved in the creation of a report are straightforward for the developer, though for the user and the other people involved in the process everything might look like a black box, often also from the lack of visibility and communication between the involved parties. Actually at least in this case the black box is not the problem but the input and output that’s highly dependent on the input, not to mention that any information omitted from user’s side could have unexpected impact on output.

In many cases I found out that the information on which a report is created are minimal, for example the user comes to the developer with a question (e.g. what’s the total revenue for the past x years) and he’s expecting the developer to come with the result in terms of a report with data reflecting the answer to the respective question. When formulating the requirements, the user might even come with a list of attributes he would like to see in such a report, some of the attributes might not be possible to be shown given the needed report’s level of detail or that it makes sense to add other attributes (e.g. Unit of Measure, Currency, Price Unit, Quantity) in order to avoid possible confusion, miscalculations, and to facilitate the understanding, testing or usage of such a report. There could be cases in which the user has no really idea of what he wants, being difficult to describe the problem and the exact data he might need, and this could happen also because the user isn’t aware of the data and their structure, processes or the various constraints and other business rules that applies.

 When from the whole initial dataset based on the requested attributes the user needs only a smaller subset, this is translated into a set of constraints (e.g. the report is needed is given only for a single Vendor or a time unit), then it makes sense to consider some of the constraints as parameters, allowing to fit the report to a larger set of requirements. In addition the user might need to use as filter additional attributes not already considered in constraints. For an easier report’s understanding the columns could be positioned in a certain order and the records sorted based on a given list of ordered attributes. For the same reason special formatting could be applied for some of the attributes, for example the values could be right/center/left aligned, the dates written is a predefined format (e.g. DD/MM/YY), the rounded to a given number of decimals, the negative values could be highlighted in a another color, etc.  

When the requirements don’t match the semantic or relational data model, then must be found the adequate level of details, for this maybe being needed to renounce to some of the attributes, eventually split the report in summary and detail report, aggregate some of the values, consider only last/first record from a given subtable, all these aspects needing to be mitigated with the users. Some of the calculations the user needs could be included directly in the report, the exact applied formulas needing to be given as input by users, the special cases needing to be mitigated too.

As the same data could come from different sources, having different timeliness, accuracy, consistency, completeness, availability, in other words with different quality, the adequate data source needs again to be mitigated with the users, in theory needing to be considered the data with the most appropriate data quality. In addition to these constraints could be added also the technical constraints derived from the volume of data, report type, storage type, available time, data transformations needed, reporting platform, etc.

There are several other important facts that should be actually considered in requirements gathering phase, namely on whether a similar report already exists, could be used unaltered or needs to be modified by adding additional attributes, filters, formatting, etc. As all such work needs to have also a ROI argumentation, there are also cases in which the costs of creating such a report are higher then the costs associated with the user preparing the report, on the other side such costs are not always so easy to quantify, though it’s a good idea to have the needed reports as much as possible automated. In such cases might be requested some input from the Functional and IT Managers, the enforcement of policies and processes to deal with such aspects.

The above aspects are gathered in the below set of questions, users and developers should address them altogether, and even if some apply only to developers, some awareness from the users could be beneficial to:
Q1: What’s the problem statement? (What issues tries the report to address?)
Q2: What kind of data would need to be gathered (in order to solve or better understand the problem)?
Q3: What level of detail is needed?
Q4: What attributes are need?
Q5: Is there a similar report (that could be extended or used as template)?
Q6: What are report’s definition, purpose and financial argumentation?
Q7: Does it make sense to invest time, effort and money in creating such a report? What's the estimated effort to build the report? What's the halting point?
Q8: What attributes it makes sense to add/remove?
Q9: What’s the (appropriate) source for the attributes?
Q10: What special filter constraints are needed?
Q11: Which of the filter constraints should be considered as parameters?
Q12: Which are the formulas that need to be applied for calculated values?
Q13: What (default) sorting the user needs?
Q14: What’s the needed order of the attributes in the layout?
Q15: Are there any aggregations that need to be added?
Q16: In what form the report should be delivered?
Q17: What formatting should be applied?
Q18: What’s report’s frequency?
Q19: Who will be report’s owner?
Q20: How many records the average report will have?
Q21: What other technical or logistic constraints qualify?
Q22: When is the report needed?
Q23: What documentation is needed?
Q24: Are there any sensitive information contained?
Q25: Who needs access to the report? Do any externals need access to it?
Q26: Are there further uses for the same data? 
Q27: What's the expected life-expectancy for the report?
Q28: What's the reporting platform on which the report should be built?
Q29: Does it makes sense to split the report in multiple perspectives?
Q30: What's the expected performance for the report? Are there any limitations?

Note:
The above list of questions should not be considered as complete, some of the questions could be merged while some of the question could lead to other questions, especially in the area of technical and logistics constraints that need to be addressed adequately.

Previous Post <<||>> Next Post

Market Review: What's New around the World

Instead of Introduction

    May is gone and it was quite a stressful month because given the fact that I had to start with a new job, a new position I will have to grow in, I had also to find a new apartment, move, install myself, so I barely had time for my blogs, the little time I had spending it reading stuff related to the new requirements, trying to improve my German, etc. Even so, I found a few interesting resources, adding them to the pile of other interesting links I collected over time. So, I realized that maybe it makes sense to group some of them in a set of postings under the title “What's New (In My World)”, hopefully they will be of interest for other people too.

Free Universities and Resources for SQL Developers & Co.

    I was kind of positively surprised to discover that several known authorities in their fields opened their own online Universities for the masses, a new free formal form of study on the web that takes advantage of the latest trends in online learning – webcasts, PowerPoint presentations, blogging posts, online communities, new technologies and applications, etc. The most unexpected appearance is the SQL University having as faculty known professionals and bloggers like Jorge Segarra, Adam Machanic, Pinal Dave or Kendal Van Dyke, to mention only the ones whose blogs I previously known, but also others not less important professionals. The posts cover topics like SSIS, PowerShell, parallelism, database administration, etc. and I find them quite a useful resource of knowledge not only for beginners but also for experienced professionals.

   Early this year I discovered the BeyeUniversity that currently offers a series of 5 free certificate programs for BeyeNetwork members, covering Business Intelligence related topics, the online video courses having an easy to follow format, the content being approachable also by non-professionals. Between the instructors could be recognized Bill Inmon, one of the pioneers in data warehousing, leading the course on Unstructured Data and being also a columnist in BeyeNetwork newsletters, several others webcasts with him being also available. Ralph Kimball, the other important personality in the area of data warehousing has opened the Kimball University and even if its courses are not free, on his website could be found a big collection of articles reflecting the ideas of his school of thought. 

   The MIT OpenCoureWare initiative is quite old, though maybe not so many people know that there are made available also courses on Database related topics, for example Database Systems, Database, Internet, and Systems Integration Technologies or Data Mining, the lecture notes and study materials offering a high level overview of the concepts related to the mentioned subjects. Some of the other courses offer also video or audio content, though more such resources could be found on YouTube EDU, Academic Earth, Lecturefox or similar resources, being covered a growing range of database related topics. Similar such resources could be found also with a simple search using the favorite search engine, many Universities have made available online, voluntarily or involuntarily their text, video or audio lecture notes.

Microsoft News

    With SQL Server 2008 R2, Visual Studio 2010 and MS Office 2010 Microsoft closes a new cycle of releases, the products reflecting Microsoft’s philosophy and the degree to which the company made it to hold their promises. It will be interesting to see how the products will be regarded by professionals and press, how they fit in the general technology trends, how they could be used together in order to provide highly scalable and available solutions. I really hope that Microsoft have made some important steps in bridging the gaps existing in their tools integration on 64 bit version, and that the new added functionality makes a difference.

24 April 2010

🧭Business Intelligence: Troubleshooting (Part I: A Problem Solving Approach)

Business Intelligence Series
Business Intelligence Series


Introduction

In several occasions I observed that there are SQL and non-SQL developers who don’t know how to troubleshoot a programming problem in general, respectively a SQL-related issue in particular, and I’m not referring here to the complex problems that typically require the expertise of a specialist, but simple day to day situations: troubleshooting an error thrown by the database engine, an error in the logic, a performance issue, unavailability of resources, etc. I’m not necessarily talking here about the people posting questions on forums, professional networks or blogs, even if in many situations they could have found an answer to the problem by doing a little research, but seeing developers actually at work. It’s true that there are also many cases in which the software throws an error message that you don’t know from where to start or that the error is pointed as appearing at other line than at the line where actually the error occurs, leading the developer to a false thread.

Before going into detail let’s take a short look at troubleshooting and what it means! Paraphrasing Wikipedia’s general definition for troubleshooting, troubleshooting in IT is a type of problem solving applied to software and infrastructure related issues. Software issues refer not only to the various types of errors thrown by software applications, but also to functional, rendering or configuration errors, performance issues, data quality issues, etc. Infrastructure related issues could refer to the IT infrastructure – network, information systems, processes, methods or methodologies used. In this post I will refer only to the software issues even if the technique(s) for troubleshooting this kind of issues could be applied also to infrastructure issues.

Polya’s Approach to Problem Solving

In his book 'How To Solve It', G. Polya, a well known Hungarian mathematician, advances a 4 step natural approach in solving a problem: 1. understanding the problem, 2. devising a plan, 3. carrying out the plan, and 4. looking back [1]. G. Polya’s approach could be used for all types of problems, including IT problems, and even if I find this approach too high level for solving this type of problems, it’s actually a cornerstone on which more detailed approaches could be used. Let’s look shortly at each of Polya’s four steps!

1. Understanding the problem 
 
Understanding the problem resumes in identifying what is known, the data, the actual facts, and what is not known, what causes the issue and how it will be solved. Somebody was saying that a problem well understood is half solved, and there are quite good chances to arrive to the wrong solution if the problem is not well understood. If in Mathematics the problem is defined beforehand together with the whole range of constraints, in IT for example, when troubleshooting the problem needs to be defined,  in the context of this post the problem revolving around a technical or business issue appearing in the form of an error message, un unexpected/wrong application behavior, wrong process, etc. Thus the actual facts could resume to the error message, current vs. expected behavior, tools used, high/low level design, business logic, affected objects, specific constraints, etc. 
 
Defining the issue/problem might not be as simple as it seems, especially when the issue is pointed by other people in informal non-technical terminology, fuzzy formulations like “there is an error in the XYZ screen” without actually detailing what the issue is about, the steps followed and the input that resulted in the respective issue, and other such aspects that need to be addressed in order to understand the problem. All these aspects are not known by the developer though with a little investigation they are transformed in known information, this involving communication with the users, looking in documentation, and gathering any other facts. Actually we could group all this actions under “gathering the facts” syntagma, and this type of operations could be considered as part of this step because they are intrinsic in what concerns problem understanding.

2. Devising a plan 
 
In this step is attempted to find the connection between the data and the unknown, looking at the problem from different angles in order to obtain an idea of the solution, to make a plan [1]. We have a plan when we know which steps we have to follow in order to identify the issue (solve the problem), they don’t have to be too detailed, but addressable, not necessarily complete but as a base that could be evolved with time, for example when new information/results are found. It could be multiple directions to look into, for example based on possible list of causes, constraints the various features comes with, different features for implementing the same thing, etc. 
 
Naturally the first question a developer should ask: have I seen this issue before in actual or slightly modified form? Could be the problem broken down to smaller (known) problems? Could be derived anything useful from the data, have been considered all essential notions involved in the problem [1]? Essential notions, that’s always a thing to look into, mainly because I would say that many issues derive from feature constraints or from misuse of features. There could be used tools brainstorming, check lists, root-cause analysis, conceptual mapping, in fact any tool which could help us to track the essential notions and the relations between them.

3. Carrying out the plan 
 
Once the plan sketched, we could go on and approach each of the branches of the plan, performing the successive steps in one branch until we find an end-point (a point in which we can’t go further). There could be branches going nowhere, multiple solutions, or no apparent solution to the problem. Everything is possible… More likely while advancing in carrying out the plan, we could discover other intermediary steps, other branches (alternatives of arriving to the same result or to approach different constraints).

4. Looking back 
 
According to Polya, this step resumes to examining the solution [1], reviewing the argumentation used, solution’s construction, on whether the solution is optimal, on whether it could be reused to solve other types of problems or whether it could be improved/refactored. Actually this is a step many developers completely ignore, they found a solution, it’s working so their work is done! No, even when pressed by time should be considered also these aspects of problem solving, and from my point of view this step includes also steps like documenting the issue, and in special cases communicating the solution found to the circle of professionals (e.g. in terms of best practices or lessons learned, why not a blog post, etc.). Topics like optimality and refactoring  and are quite complex and deserve a post of their own, therefore I will resume myself to mention only the fact that they typically consider the solution from the point of view of performance, complexity, (re)usability and design, the developer having to trade between them and other similar (quality) dimensions.

Beyond Polya’s Approach

A natural question: do we really have to follow this approach?! Come on, there will be cases when you’ll have the solution without actually attempting to define the problem (explicitly) or devise a plan (explicitly), or only by listing the scope and the constraints! Unconsciously we are actually following the first three steps, but forget or complete ignore the fourth, and I feel that applying Polya’s approach brings some “conscious thought” in this process that could help us make the most of it. 
 
In many cases the solution will be there in documentation, giving developers some explicit or implicit hints about the areas in which to search, for example in case of an error related to a query a first input is the error message returned by the database engine. Fortunately RDBMS vendors like Microsoft and Oracle provide also a longer description for each error, allowing thus to understand what the error message is about. This is the happiest case, there are also many software tools that after they run half of hour, they return a fuzzy error message (e.g. ‘an error occurred!’), and nothing more.

Thank God for the Internet, a dynamic knowledge repository, in which lot of knowledge could be find with just a simple click, but also sensitive to the input. In many cases I could found one or more solutions or hints for an error I had, usually just by copy pasting the error number of the error description, or when the description is too long, only the most important part. I observed, that there is an quite important number of professionals that prefer to post their issue in a forum or professional group instead of doing some research by themselves, this lack of effort helping to increase the volume of redundant information on the web, this coming with negative but also positive implications. 

When we perform such a search, we actually rely on the solution provided by other users, shortcutting the troubleshooting process, and with the risk of repeating the same syntagma, it comes with negative but also positive implications. For example, a negative aspect is that people don’t learn how to troubleshoot by themselves relying instead on ready-available solution, while a positive aspect is that less time is spent within troubleshooting process, at least in theory. Actually, considering the mentioned positive aspect, that’s also why I consider as important the “looking back” step, and I’m referring especially at documenting the issue action.
 

References:
[1] G. Polya (1973) How To Solve It: A New Aspect of Mathematical Method, 2nd Ed.Stanford University.  ISBN: 0-691-08097-6.

21 April 2010

#️⃣Software Engineering: Programming (Part II: To get or not Certified?!)

Software Engineering
Software Engineering Series

To get or not certified?! That’s a question I asked myself several times along the years, and frankly it doesn’t have an easy answer because there are many aspects that need to be considered: previous education, targeted certification, availability of time, financial resources or learning material, required software, hand-on experience, certification’s costs, duration/frequency, objectives, value (on the market) or requirements, contexts, etc. 
In many occasions when I had most of the conditions met then I didn’t had the time to do it, or I waited to appear the requirements for the new set of certifications, referring mainly to SQL Server 2005 and 2008 versions, or I preferred to continue my “academic” studies, so here I am after almost 10 years of experience in the world of SQL without any certification, but, I would say, with a rich experience covering mainly full-life cycle development of applications, reporting, data quality and data integration, ETL, etc. 

Enough with the talking about myself and get to the subject. I’ve seen recently this topic appearing again in 1-2 professional groups, so I’ll try to approach this topic from a general point of view because most of the characteristics could apply also to database-related certifications like Microsoft MCITP (Microsoft Certified IT Professional) or MCTS (Microsoft Certified Technology Specialist) for SQL Server.

Naturally, in what concerns the certification, the opinions between professionals are split, an often met argument against it is the believe that a certification is just a piece of paper having a limited value without being backed-up by adequate hand-on experience, while the pro-argumentation is that some companies, employers and customers altogether, are valuing a certification, considering it as a personal achievement reflecting not only owners’ commitment to approach and take a certification exam, but also a basic level of knowledge. Both views are entirely correct from their perspective, weighting differently from person to person, community to community or from one domain of expertise to another, and they have positive and negative aspects, many subjective aspects as they are related to people’s perception.

From a global perspective an IT “certification fulfills a great need by providing standardized exams for the most current and important technologies” [3], allowing judging people’s knowledge on the topics encompassed by it[1], being thus a way to quantify knowledge especially related to general tasks. Certifications offer a path to guide the study of a domain [2], are developed around agreed-upon job tasks [3] and consider a basic knowledge base made of vocabulary, definitions, models, standards, methods, methodologies, guidelines or best practices.

Given the fact that a certification covers most of the topics from a given domain, in theory it provides a wide but of superficial depth coverage of the respective domain, in contrast with the hand-on experience, the professional experience accumulated by solving day-to-day tasks, which provides a narrower (task-based) but deeper coverage of the respective domain. Therefore, from my point of view the two are not necessarily complementary but could offer together a wide and deep coverage of the domain, a certification needs somehow to be based on a certain number of years of hand-on experience in order to get more value out of it. 

On the other side, the variety in hand-on experience could offer wider coverage of the domain, though I suppose that could be accomplished fully by had-on experience but in a longer unit of time. These suppositions are fully theoretical because there are many other parameters that need to be considered, for example a person’s capacity of learning by doing vs. theoretical learning (this involves also understanding of concepts), the learning curve and particularities of the technologies, methods or methodologies involved, the forms of training used, etc.

A certification is not meaningless, as several professionals advance (e.g. J. Shore, T. Graves & others), even when considered from employers’ perspective, and the fact that it doesn’t count for some employers or professionals, that’s another story. A certification could be considered eventually useless, though also that’s not fully true. Maybe the certification itself is useless for a third party, though it’s not from the point of view of the learning process, as long the the knowledge accumulated is further used and the certification is not an end in itself.  
A certification is not or it shouldn’t be an end in itself, it should be a continuous learning process in which knowledge is perpetually discovered, integrated and reused. Most probably in order to keep the learning process continuous several certifications, including MCITP, require to be recertified after a number of years.

There are professional certifications that require provable experience in the respective domain before actually being accepted for a certification, it’s the example of PMP (Project Management Professional) and CAPM (Certified Associate in Project Management) certifications from PMI (Project Management Institute) that require a considerable amount of hours of non-overlapping direct or indirect PM experience, and the example is not singular, if I’m not mistaking also the CISP (Certified Information System Security Professional) certification requires a certain number of years of experience. This type of requirement allows in theory to make most of the learning process being facilitated the integration of knowledge with experience.

How useful is a certification for the certified person?! It depends also how much a certification succeeds in covering the knowledge, skills and abilities required by an actual job, how much of the knowledge acquired will be later used. There are people who focus only on taking the exam, nothing more, though I would say that might come with other downsides on the long term. There are even organizations that encourage and even sponsor their employees’ certification either by providing training material, courses, partial or full-expenses, such initiatives being often part of their strategic effort of creating value and a knowledge-based environment, the professional certification being also a form of recognition, being valued in what concerns employees performance, eventually associated also with a form of remuneration.

I think that a certification could be beneficial for a person with relatively small or no professional experience in a certain domain, the certification bridging to a small degree the gap to hand-on experience. It could be interesting to study whether the on-hand experience could be compensated to some degree by attempting to (re)use the learned concepts in self-driven applications or several examples. 
When learning something new I found it useful to try writing a tutorial or a blog post using a well-defined example, though this won’t replace entirely the on-hand experience, the difference between the two being the limited vs. the global scope of handling tasks, in dealing with real-life situations. Most probably it could be also useful to learn about the use of a technique/technology in several contexts, though this equates with lot of research and effort spent in that direction. Does it worth to do that?!

A certification is an opportunity to enter in a “select” circle of professionals, though now it depends also how each vendor or group of certificates takes advantage of this “asset” and what other benefits are derived out of it. For example by publishing domain related content certificates could be kept up-to-date with new features, trends, best practices, etc., the professional network thus created could benefit of the potential such networks offer especially when considering problem solving, the creation, propagation and mapping of knowledge, etc. Of course, such networks could have also side effects, for example the creation of exclusivist networks. I would say that the potential of professional networks is still theoretic, but with the evolution of the Web new possibilities will emerge.

A person taking such a certification arrives in theory to cover most of the important topics related to a given domain, however this doesn’t guarantee that the person is actually capable of applying (successfully) the concepts and techniques in real life scenarios, the many “brain dumps” and other easy ways of taking a certification decreasing certification’s credibility and value. There are domains over-flooded by people with certifications but not having the skills to approach a real project, a company that gives too much credit to a certification could end up stuck with resources that can’t be used, this aspect impacting negatively other professionals too. 
I’m coming back to the idea that a certification is subject of people’s perception and I have to say that the most important opinion in this direction is not necessarily professionals’ opinion activating in the respective domain, but of the people from HR, PM and partially headhunters, because they are the ones who are making the selection, deciding who’s hired and who’s not. Considering that there are few professionals from HR and PM that are coming from the IT domain, there are lot of false and true presumptions when evaluating such candidates, people arriving to come with their own methods of filtering the candidates, and even if such methods are efficient from the result perspective, many good professional could feel kind of “discriminated”.

In theory it’s easier to identify a person who has a certification than to navigate through the huge collection of related projects and tasks, or to search in a collection of CVs for the various possible combinations or significant terms related to a job description. Somebody (sorry, I don’t remember who) was saying that a manager spends on average 20-30 seconds for each CV, now it depends also how eye-catching is a certification in a simple CV scanning. 
From a semantic point of view I would say that a certification is richer in meaning than any type of written experience, though now it depends also on reviewer’s knowledge about the respective certification. Sure is that when choosing between two professionals with similar experience there are high chances for the one having a certification to be hired. In addition, considering that there are hundreds of applicants for the good jobs on the market, I would say that a certification could allow a candidate, between many other criteria, to distinguish himself from the crowd.

Given the explosion of technologies from IT, domain’s dynamics, segmentation and other intrinsic characteristics , the IT certifications are more specialized, more segmented and less standardized, making difficult their evaluation, especially when domains intersect each other or when the vendors emitting the certifications are competing against each other. Compared with other domains, an IT professional needs to be always up-to-date, cover multiple related domains in order to do his work efficiently, for example in order to provide a full-life cycle solution a developer would have to be kind of expert in Software Engineering, UI and database programming, security, testing, etc. The high segmentation in IT could be seen also in the denominations for the various roles, lot of confusion deriving from this, especially when matching the job descriptions with the roles.

Must be considered also the bottom line: in IT as also in other domains, the knowledge and experience is relative because it depends also on person’s skills and ability of assimilating, using, reusing (creatively) the knowledge given in a domain; a person could in theory accumulate in one year same experience as others in 2 or more years, same as a person who got certified could in theory handle day-to-day tasks without any difficulty, same as in theory a student with no professional experience could handle programming tasks like a professional with several years of experience. 
At least in my country, there are many domains in University that provide also IT-related curricula within non-IT domains (e.g. Mathematics, Economics, Engineering), a number of programming courses being thought also in high school or even lower grades, the theory learned and the small projects facilitating theoretically the certification for a programming language (e.g. C#, Java or C++) or of directly handing day-to-day tasks. It’s true that in school is insisted more on the syntax, basic features and algorithmic nature of programming, but this doesn’t diminish the value of this type of learning when done adequately. Such educational experience is not considered as professional experience at all, even if it provides a considerable advantage when approaching a certification or a job.

It must be highlighted that taking a certification comes with no guarantees for getting a job or being successful in your carrier/profession. You have to ask yourself honestly what you want to achieve with a certification, how you’ll use the learning process in order to get most of it. You actually have to enjoy the road to the final destination rather than dreaming about the potential success brought by such a certification. It could take actually more time until you’ll recover your investment or you’ll see that the actual invested time worth, and, as always some risks need to be assumed. Consider the positive and negative aspects altogether and decide by yourself if it makes sense to go for a certification.

There is actually a third choice – continuing the academic studies, for example pursuing a bachelor, masters or why not, a doctoral degree. The approach of a such a degree imposes similar questions as in the case of a certification, though academic degrees are in theory better accepted by the society even if they come with no guarantees too, require more effort and financial resources.


References:
[1] K. Forsberg, H. Mooz, H. Cotterman. (2005). Visualizing Project Management: Models and Frameworks for Mastering Complex Systems. John Wiley & Sons. ISBN: 0-978-0-471-64848-2.
[2] D. Gibson (2008). MCITP SQL Server 2005 Database Developer All-In-One Exam Guide. McGraw-Hill. ISBN: 978-0071546690.
[3]  L.A. Snyder, D.E. Rupp. G.C. Thornton (2006). Personnel Selection of Information Technology Workers: The People, The Jobs, and Issues for Human Resources Management. Research in Personnel and Human Resources Management, Vol. 25, Martocchio J.J. (Ed.). JAI Press. ISBN: 978-0762313273.

17 April 2010

#️⃣Software Engineering: Programming (Part I: Learning SQL)

Software Development Series
Software Development Series

Introduction

One of the topics that appears in many database related groups or forums is from where and how new comers should start to learn SQL (Structured Query Language) in general, and specific vendor SQL features in particular. Two years back, in Learning something new post from my programming-grasps blog attempt, I wrote down my thoughts on how to approach learning new subjects in general, many of the respective observations could be applied to SQL learning too, and in addition could be added also domain-specific recommendations. 

I would say that when discussing a topic like learning it makes sense to approach the subject from users level of knowledge and skill-set, different recommendations could apply for beginners, intermediate and expert level. No, also when you are an expert the learning process doesn’t stops there as SQL and RDBMS (Relational Database Management System) evolve with time, new features and techniques being added, same as new contexts of applicability could emerge (e.g. very large databases, geographic information systems, Web-related SQL, etc.).

The Beginner Level

Approaching a topic from a beginner’s perspective is seldom an easy task given that many new comers don’t know from where to start, what to learn first, what RDBMS to learn first, etc. I was and I’m still at that point as new subjects appear on my learning schedule. One important decision is what RDBMS solution to choose, and for this you might need to consider several aspects: market share, goals/objectives, functional area (e.g. database administration, database development, data warehousing, business intelligence, ETL, data quality), personal preferences, salaries and jobs postings, UI, features, tools, learning curve, performance, programmability, extensibility, integration and compatibilities with other technologies, availability of documentation, books, tutorials and other learning material, etc.

At least for now, Oracle, Microsoft and IBM are the leaders in RDBMS and related data technologies, this being also the ranking in what concerns the market share and job postings (see for example B. Nicolich’s SQL Server Job Trends article). If you don't have any preferences, my recommendation is to focus on at least one of these RDBMS: Oracle, SQL Server and MySQL. Unfortunately, given researches’ subjectivity, I can’t delimitate which of the three has better performance or is better to use, being important the context in which the studies were made, the overall architecture and their particularities, etc. 

Having a Microsoft technologies background I like SQL Server’s easy to use UI and integrated solution approach it provides, on the other hand Oracle seems to provide more SQL-related features (e.g. functions), while MySQL is still trying to catch the other two vendors and become more mature. All three vendors implement SQL ANSI 92 standard, so learning SQL on one of the platforms would allow you to write queries also on the other platforms, the difference residing in the functions used, DML (Data Manipulation Language) or DDL (Data Definition Language) syntax and specific functionality, architecture and tools.

For SQL Server and Oracle you could download the trial versions, or opt for the Express versions that could be used free but with certain constraints, while for MySQL you could download one of the free community editions, quite stable solutions. If SQL Server comes with a nice Management Studio that could be used to write queries and administrate your databases, for Oracle and MySQL you might have to consider tools like Toad for OracleOracle or SQL Developer, respectively Toad for MySQL, that allow to be more productive on the respective platforms. If you are intending to use a RDBMS more for personal use (e.g. data analysis), what I call a personal database, then MS Access and MS Excel could prove to be enough, especially when is needed to develop personal or small scale solutions in an organization with a limited knowledge on SQL, though a more evolved RDBMS (actually MS Excel is not a RDBMS but it could be used as a data source or data analysis tool) is recommended when providing solution for multiple users.

All three vendors provide rich documentation, with a plus for SQL Server and Oracle, through the online MSDN or Technet, respectively Oracle Database Documentation Library, and MySQL Reference Manual, the knowledge provided by the documentation being complemented by the many published books, magazines, tutorials, blogs, webcasts, podcasts, studies, best practices and other type of technical literature, many of them easily retrievable with just a simple search using your favorite search engine, important being to know what you are looking for. You have to consider that the technical documentation focuses more on the features, syntax and technical details, while for specific purposes like development, architecture integration or problem solving you might have to consider the other sources of knowledge with a plus for books, tutorials and blogs, the later two being actually more up-to-date than the books, given their shorter “time to market” and informal character.  

If your interest in SQL and databases derives from pure educational purposes then you could also consider the free online course materials (presentations, video lectures) posted within initiatives like the ones of YouTube EDU or MIT Open CourseWare, many other lecture notes being available online from various well-known universities. Other important source of knowledge that gains field in the past years is the one of social and professional networks centered around the important technologies, so joining such a network could bring some benefit but you don’t expect miracles!

In what concerns the books maybe it makes sense to list at least my preferences given the huge volume of books on this topic. I’m a fan of “for Dummies” series, therefore I find Allen Taylor’s SQL for Dummies and its more extensive version SQL All-In-One Desk Reference for Dummies or Database Development for Dummies as a good starting point. If you are targeting SQL Server then you might need to check also Microsoft SQL Server 2008 For Dummies, Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies, Microsoft SQL Server 2005 Programming For Dummies or Microsoft SQL Server 2005 Express Edition For Dummies. In exchange the list of choices for Oracle or MySQL is quite smaller, as far I know there are only two books on the market on Oracle - Oracle PL/SQL For Dummies and Oracle 11g For Dummies, while for MySQL the books from “for Dummies” cycle target the use of MySQL in combination with other scripting programming languages: Apache, MySQL, and PHP Web Development All-in-One Desk Reference, PHP & MySQL For Dummies or PHP & MySQL Everyday Apps For Dummies. Another popular series are the “for Beginners” or “Beginning” series, having a more narrow focus, for example Beginning T-SQL with Microsoft SQL Server 2005 and 2008 or Beginning PL/SQL: From Novice to Professional could be two of the books you could start with.

Choosing the right book is highly dependent on what you want to achieve, your learning style and author preferences, books’ availability and costs, in general each professional having his/her own list of preferences. Years back I was quite stressed because many of the examples written in books were not working, nowadays hopefully that changed positively, in addition many publishers making the code available through books’ supporting sites or publishers websites together with a few sample chapters from the respective books. If you don’t want to buy the whole book or you’d like to read it online, you could become member of online libraries like Safari Books Online, Skillsoft’s Books 24x7, eBrary or SpringerLink. Many books are starting to become available for purchasing in PDF directly from publishers or in proprietary format on digital book readers like Amazon’s Kindle or Sony’s Reader, the digital books being in theory with 10-30% cheaper than the paperback products.

The Intermediate Level

The intermediate level is a relative point in the evolution of a professional, in theory this level being reached after several projects or prolonged use of SQL in various scenarios. During this stage many developers are starting to approach topics like performance or best practices, going thus above the basics, or interest themselves in taking a certification, attempting thus to approach the whole range of topics related to databases. Some of the books listed above qualify as well to the intermediate or expert levels, it depends also on books’ level of detail and approach on the given topics. The books’ series closer to this level are the ones for Professionals, for example Pro SQL Server 2008 Relational Database Design and Implementation, Pro T-SQL 2008 Programmer’s Guide, Professional Oracle Programming, Pro MySQL or Professional MySQL.

For those targeting the MCTS certification, it makes sense to consider the Self-Paced Training Kit Exams: (
Exam 70-432) Microsoft SQL Server 2008-Implementation and Maintenance, (Exam 70-433) Microsoft SQL Server 2008-Database Development, while for those approaching the MCITP certification then MCITP SQL Server 2005 Database Developer All-in-One Exam Guide is quite useful. As I haven’t approached the Oracle or MySQL certifications, I think it’s more indicated to do some research by yourself. 

The Expert Level

At this level developers are starting to go deeper into the database internals, learning how things are handled in the background, differentiate between multiple solutions and choose the optimum solution. Books like Microsoft SQL Server 2008 Internals, SQL Server 2008 Query Performance Tuning Distilled or Professional SQL Server 2008 Internals and Troubleshooting shouldn’t miss from a developer’s library! I would expect there are similar books also for Oracle and MySQL, but as I’m not aware of them so I will stop here.

At this level a developer could also approach the database topics from a higher level, making an incursion in the mathematical/conceptual theory of relational databases, books like Information Modeling and Relational Databases or Applied Mathematics for Database Professionals could become really handy in understanding such topics.

Note:
Most probably there are even better books out there though, as I highlighted above, these are my recommendations based on the books that have fallen in my hand along the years. In many cases, instead of buying a book, I’m looking on whether I could find on the web the information I need, though the quality of the material varies, sometimes same as the quality of the books. There are even cases when the information are contradictory, however in most of the cases I would go with the experts that have the information from the source. 


05 April 2010

💎SQL Reloaded: The Power of Joins VI (Not-Equal Joins)

    SQL Server 2008 documentation mentions the not-equal joins without giving an explicit definition, the term appearing within the context of joining values in two columns that are not equal. On the other side the not-equal join denomination could be used to refer more generally to the joins whose join constraints involve comparison operators other than equal (“=”), for example <, >, <>, BETWEEN, IN, and its negation NOT IN, LIKE, etc. The difference between the two views resides on the fact that some of the operators (e.g. IN, BETWEEN) could imply the equality of values from joined columns. Another objection could be made on whether the not-equality join constraint could be based only on one of the table’s attributes participating in the join, the other thus missing, such constraints could be considered also as join constraints even if in some scenarios they could have been written in the main WHERE clause as well. For the sake of simplicity and flexibility I would consider a not-equal join or not-equality join as a join in which at least one of the join constraints is based on a operator other than the equal operator. Could be discussed thus about not-equal join constraints referring to the join constraints that involve other operators than equal, and equal join constraints, the two types of joins could coexist in the same join. A not-equal join constraint often involves the possibility of being returned more records then if a equal join constraint was used.

      If we talk about join constraints then more likely we refer to vertical joins, though also anti-joins and semi-joins could include sporadically not-equality operators. In a normal join based on equalities for the set of attributes participating in the join from the left table are matched all the records from the right table, even if this means retrieving more than one record. Same happens also for not-equal constraints even if the business scenarios in which such needs arises are not so intuitive. One relatively simple example I often met is the case when is needed to calculate the incremental aggregation of values, for example to calculate the incremental volume of PO placed over time:
 
SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
    CROSS APPLY (--incremental PO Volume 
     SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
   WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
       AND DATEDIFF(d, POH.OrderDate, DII.DateSequence)&gt;=0 --not-equal constraint 
    ) POH 

    In order to simplify the query I used the dbo.fGetDatesInInterval table-valued function that returns a sequence of dates within a given time interval, and a CROSS APPLY returning the aggregated value for each date returned by the function, for the calculations being considered all the records having an Order Date smaller or equal with the given (sequence) date. As can be seen from the above query the not-equal constraint is based on DateDiff function, instead we could have written POH.OrderDate<=DII.DateSequence, one of the reasons for its use is the fact that the constraint could be easily modified to show for example only the aggregated values for the past 90 days, the BETWEEN operator entering in scene:
 
SELECT DII.DateSequence 
, IsNull(POH.SubTotal, 0) SubTotal 
FROM dbo.fGetDatesInInterval('2002-01-01', '2002-01-31') DII 
     CROSS APPLY (--incremental PO Volume 
    SELECT SUM(POH.SubTotal) SubTotal 
    FROM Purchasing.PurchaseOrderHeader POH 
    WHERE POH.Status IN (2, 4) -- 2-Approved, 4-Complete  
        AND DATEDIFF(d, POH.OrderDate, DII.DateSequence) BETWEEN 0 AND 90--not-equal constraint 
    ) POH 

   The not-equal constraint from the query is part of interval-based constraints category which covers the scenarios in which one of the attributes from the tables participating in a join is checked whether it falls within a given interval, typically that being valid for numeric values as above or date value, as in the next query that shows the Last Standard Cost Details per Product.  

-- Products &amp; Last Standard Cost details SELECT ITM.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, PCH.StartDate 
, PCH.EndDate 
FROM Production.Product ITM 
    JOIN Production.ProductCostHistory PCH 
     ON ITM.ProductID = PCH.ProductID AND GETDATE() BETWEEN PCH.StartDate AND IsNull(PCH.EndDate, GETDATE()) 

    Such queries need often to be written when the records in a table have a given validity, no overlap existing between intervals.

    Using list of values within a join constraint and thus the IN or NOT IN operators occurs when for example the base table stores data coming from multiple sources and more encodings are used for the same source, each of them caring its meaning. For example using the Production.TransactionHistory and several other tables coming with AdventureWorks database, and supposing that for Purchase Orders are two transaction types P and B, and for Sales Orders S and I, a general scope query on Transactions could be written as follows:
   
-- Transaction History 

SELECT PTH.TransactionID 
, PTH.ProductID 
, PTH.ReferenceOrderID 
, CASE PTH.TransactionType 
     WHEN 'W' THEN 'Work Order' 
    WHEN 'S' THEN 'Sales Order' 
    WHEN 'I' THEN 'Internal Sales Order' 
    WHEN 'P' THEN 'Purchase Order' 
    WHEN 'B' THEN 'Blanket Purchase Order' 
    ELSE 'n.a' 
  END TransactionType , CASE  
    WHEN PTH.TransactionType = 'W' THEN StartDate 
    WHEN PTH.TransactionType IN ('S', 'I') THEN SOH.OrderDate 
    WHEN PTH.TransactionType IN ('P', 'B') THEN POH.OrderDate 
  END ReferenceDate FROM Production.TransactionHistory PTH 
    LEFT JOIN Purchasing.PurchaseOrderHeader POH       ON PTH.ReferenceOrderID = POH.PurchaseOrderID   AND PTH.TransactionType IN ('P', 'B') --not-equal constraint 
   LEFT JOIN Sales.SalesOrderHeader SOH 
     ON PTH.ReferenceOrderID = SOH.SalesOrderID 
  AND PTH.TransactionType IN ('S', 'I')  --not-equal constraint 
    LEFT JOIN Production.WorkOrder PWO 
      ON PTH.ReferenceOrderID = PWO.WorkOrderID 
   AND PTH.TransactionType = 'W' 

    The need for writing such a query is seldom and could have been written as 3 distinct queries whose results are joined with two unions, both approaches having their pluses and minuses. 

   In one of the above queries was used a constraint based on DateDiff function, and in theory any other function could be used in a join constraint, including user-defined functions. Such function-based join constraints are handy but should be used with caution because they could impact query’s performance. Pattern-based join constraints used with LIKE operator could be used as well.

    There are cases in which the constraints that typically would be included in the WHERE clause are added in the body of the join, apparently without any good reason. Such based denatured join constraints are base only on the attributes of one of the tables involved in the join, like in the next query:
   
-- PO for Products with StandardCost&gt;10 SELECT POD.PurchaseOrderDetailID  
, POD.ProductID  
, ITM.ProductNumber  
, ITM.StandardCost 
, POD.UnitPrice 
, POD.OrderQty 
FROM Purchasing.PurchaseOrderDetail POD  
    JOIN Production.Product ITM       ON POD.ProductID = ITM.ProductID 
  AND ITM.StandardCost&gt;10  -- denaturated join constraint
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.