About Me

IT Professional with more than 16 years experience in IT especially in the area of full life-cycle of Web/Desktop Applications Development, Database Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP support, etc.

Tuesday, March 16, 2010

Excel for SQL Developers – Part IV: Differences Between Two Datasets

    One of the tasks which appears from time to time on my table is to determine the differences between two datasets coming from different systems or from the same system/source but taken at different points in time. Even if it’s fairly simple to write such a query, especially when the number of attributes is quite small, there is lot of repetitive work that needs to be done because typically for each set of matched attributes needs to be added a third attribute (difference flag) showing whether there is a difference or not. In time I started to use Excel formulas to create the SELECT clause for such a query, matching tables’ attribute first and categorizing the data types mainly in 4 categories: text, amounts, numeric (other than amounts) and dates. This categorization is requested first of all by the different ways in handling the NULL values for each data type, and secondly by the difference between amounts and numeric, even if they might have the same data type, the difference resides in the fact that amounts might be rounded differently in each data source, therefore a difference of a few cents will not be considered as difference.

    Given two attributes ColumnX and ColumnY from tables A, respectively B, let’s look first on how the difference flag constraint could be written for each category:
1. text attributes:
CASE
    WHEN IsNull(A.ColumnX , '')<> ISNULL(B.ColumnY, '') THEN 'Y'
    ELSE 'N'
END DiffColumnXYFlag

2. amount attributes:
CASE
    WHEN IsNull(A.ColumnX, 0) - IsNull(B.ColumnY, 0) NOT BETWEEN -0.05 AND 0.05 THEN 'Y'
    ELSE 'N'
END DiffColumnXYFlag

3. numeric attributes:
CASE
    WHEN IsNull(A.ColumnX, 0) <> IsNull(B.ColumnY, 0) THEN 'Y'
    ELSE 'N'
END DiffColumnXYFlag

4. date attributes:
CASE
   WHEN IsNull(DateDiff(d, A.ColumnX, B.ColumnY), -1)<>0 THEN 'Y'
   ELSE 'N'
END DiffColumnXYFlag


Notes:
1.   Bit attributes can be treated as numeric as long as they are considered as having a bi-state, for tri-state values in which also NULL is considered as a distinct value then the constraint must be changed, the most natural way being to translate the NULL to –1:
CASE
    WHEN IsNull(A.ColumnX, -1) <> IsNull(B.ColumnY, -1) THEN 'Y'
    ELSE 'N'

END
DiffColumnXYFlag

2.   In most of the examples I worked with the difference between two pair dates, the difference was calculated at day level, though it might happen that is needed to compare the values at smaller time intervals to the order of hours, minutes or seconds. The only thing that needs to be changed then is the first parameter from DateDiff function. There could be also situations in which a difference of several seconds is acceptable, a BETWEEN operator could be used then as per the case of numeric vs. amount values.
3.    In case one of the attributes is missing, the corresponding difference flag could take directly the value ‘N’ or ‘n/a’.
4.   It could happen that there are mismatches between the attributes’ data type, in this case at least one of them must be converted to a form that could be used in further processing.

    Thus a macro for this purpose would take as input a range with the list of attributes from the two tables, the data type category and the columns participating in the join constraint, two parameters designating the name of the left and right table participating in the FULL OUTER JOIN, the time interval considered, the error margin value interval (e.g. [-e, e]) and a flag indicating whether to show all combined data or only the records for which there is at least a difference found.

Function GetComparisonQuery(ByVal rng As Range, ByVal LeftTable As String, ByVal RightTable As String, ByVal TimeInterval As String, ByVal ErrorMargin As String, ByVal ShowOnlyDifferencesFlag As Boolean) As String
'builds the code for a comparison query between two tables
Dim attributes As String
Dim differences As String
Dim constraint As String
Dim whereConstraints As String
Dim joinConstraints As String
Dim columnX As String
Dim columnY As String
Dim index As Integer
For index = 1 To rng.Rows.Count
    columnX = Trim(rng.Cells(index, 1).Value)
    columnY = Trim(rng.Cells(index, 2).Value)
 
    If Len(columnX) > 0 Or Len(columnY) > 0 Then
       If Len(columnX) > 0 Then
            attributes = attributes & ", A." & columnX & " LT" & columnX & vbCrLf
       End If
       If Len(columnY) > 0 Then
            attributes = attributes & ", B." & columnY & " RT" & columnX & vbCrLf
       End If
      
       constraint = ""
       If Len(Trim(rng.Cells(index, 4).Value)) = 0 Then
            If Len(columnX) > 0 And Len(columnY) > 0 Then
                 'creating the difference flag
                 Select Case Trim(rng.Cells(index, 3).Value)
                 Case "text":
                      constraint = "CASE" & vbCrLf & _
                                   "     WHEN IsNull(A." & columnX & " , '') <> IsNUll(B." & columnY & ", '') THEN 'Y'" & vbCrLf & _
                                   "     ELSE 'N'" & vbCrLf & _
                                   "  END"
                 Case "amount":
                      constraint = "CASE" & vbCrLf & _
                                   "     WHEN IsNull(A." & columnX & " , 0) - IsNUll(B." & columnY & ", 0) NOT BETWEEN -" & ErrorMargin & " AND " & ErrorMargin & " THEN 'Y'" & vbCrLf & _
                                   "     ELSE 'N'" & vbCrLf & _
                                   "  END"
                 Case "numeric":
                      constraint = "CASE" & vbCrLf & _
                                   "     WHEN IsNull(A." & columnX & " , 0) <> IsNUll(B." & columnY & ", 0) THEN 'Y'" & vbCrLf & _
                                   "     ELSE 'N'" & vbCrLf & _
                                   "  END"
                 Case "date":
                      constraint = "CASE" & vbCrLf & _
                                   "     WHEN DateDiff(" & TimeInterval & ", A." & columnX & ", B." & columnY & ")<>0 THEN 'Y'" & vbCrLf & _
                                   "     ELSE 'N'" & vbCrLf & _
                                   "  END"
                 Case Else: 'error
                     MsgBox "Incorrect data type provided for " & index & " row!", vbCritical
                 End Select
                
               
                If ShowOnlyDifferencesFlag Then
                   whereConstraints = whereConstraints & " OR " & constraint & " = 'Y'" & vbCrLf
                End If
               
                differences = differences & ", " & constraint & " Diff" & columnX & "Flag" & vbCrLf
            Else
                differences = differences & ", 'n/a' Diff" & IIf(Len(columnX) > 0, columnX, columnY) & "Flag" & vbCrLf
            End If
                            
        Else
            joinConstraints = joinConstraints & "    AND A." & columnX & " = B." & columnY & vbCrLf
        End If
    
     End If
Next
If Len(attributes) > 0 Then
    attributes = Right(attributes, Len(attributes) - 2)
End If
If Len(joinConstraints) > 0 Then
    joinConstraints = Right(joinConstraints, Len(joinConstraints) - 8)
End If
If Len(whereConstraints) > 0 Then
    whereConstraints = Right(whereConstraints, Len(whereConstraints) - 4)
End If
'building the comparison query
GetComparisonQuery = "SELECT " & attributes & _
    differences & _
    "FROM " & LeftTable & " A" & vbCrLf & _
    "     FULL OUTER JOIN " & RightTable & " B" & vbCrLf & _
    "       ON " & joinConstraints & _
    IIf(ShowOnlyDifferencesFlag And Len(whereConstraints) > 0, "WHERE " & whereConstraints, "")
  
End Function

Excel - Comparison Datasets

    The query returned by the macro for the above example based on attributes from Production.Product table from AdventureWorks database and Production.Products table created in Saving Data With Stored Procedures post:
SELECT A.ProductID LTProductID
,
B.ProductID RTProductID

,
A.Name LTName

,
B.Name RTName

,
A.ProductNumber LTProductNumber

,
B.ProductNumber RTProductNumber

,
A.MakeFlag LTMakeFlag

,
B.MakeFlag RTMakeFlag

,
A.FinishedGoodsFlag LTFinishedGoodsFlag

,
B.FinishedGoodsFlag RTFinishedGoodsFlag

,
A.Color LTColor

,
B.Color RTColor

,
A.SafetyStockLevel LTSafetyStockLevel

,
B.SafetyStockLevel RTSafetyStockLevel

,
A.ReorderPoint LTReorderPoint

,
A.StandardCost LTStandardCost

,
B.StandardCost RTStandardCost

,
A.ListPrice LTListPrice

,
B.ListPrice RTListPrice

,
A.Size LTSize

,
A.DaysToManufacture LTDaysToManufacture

,
A.ProductLine LTProductLine

,
A.Class LTClass

,
A.Style LTStyle

,
A.SellStartDate LTSellStartDate

,
B.StartDate RTSellStartDate

,
A.SellEndDate LTSellEndDate

,
B.EndDate RTSellEndDate

,
A.DiscontinuedDate LTDiscontinuedDate

,
B.CreationDate RT

,
B.CreatedBy RT

,
A.ModifiedDate LTModifiedDate

,
B.ModifiedDate RTModifiedDate

,
B.ModifiedBy RT

,
CASE

    WHEN IsNull(A.Name , '') <> IsNUll(B.Name, '') THEN 'Y'

    ELSE 'N'

END DiffNameFlag

,
CASE

    WHEN IsNull(A.ProductNumber , '') <> IsNUll(B.ProductNumber, '') THEN 'Y'

    ELSE 'N'

END DiffProductNumberFlag

,
CASE

    WHEN IsNull(A.MakeFlag , 0) <> IsNUll(B.MakeFlag, 0) THEN 'Y'

    ELSE 'N'

END DiffMakeFlagFlag

,
CASE

    WHEN IsNull(A.FinishedGoodsFlag , 0) <> IsNUll(B.FinishedGoodsFlag, 0) THEN 'Y'

    ELSE 'N'

END DiffFinishedGoodsFlagFlag

,
CASE

     WHEN IsNull(A.Color , '') <> IsNUll(B.Color, '') THEN 'Y'

    ELSE 'N'

END DiffColorFlag

,
CASE

    WHEN IsNull(A.SafetyStockLevel , 0) <> IsNUll(B.SafetyStockLevel, 0) THEN 'Y'

    ELSE 'N'

END DiffSafetyStockLevelFlag

,
'n/a' DiffReorderPointFlag

,
CASE

     WHEN IsNull(A.StandardCost , 0) - IsNUll(B.StandardCost, 0) NOT BETWEEN -0.05 AND 0.05      

     THEN 'Y'
ELSE 'N'
END DiffStandardCostFlag

,
CASE

     WHEN IsNull(A.ListPrice , 0) - IsNUll(B.ListPrice, 0) NOT BETWEEN -0.05 AND 0.05 THEN 'Y'
     ELSE 'N'
END DiffListPriceFlag

,
'n/a' DiffSizeFlag

,
'n/a' DiffDaysToManufactureFlag

,
'n/a' DiffProductLineFlag

,
'n/a' DiffClassFlag

,
'n/a' DiffStyleFlag

,
CASE

    WHEN DateDiff(d, A.SellStartDate, B.StartDate)<>0 THEN 'Y'
    ELSE 'N'
END DiffSellStartDateFlag

,
CASE

    WHEN DateDiff(d, A.SellEndDate, B.EndDate)<>0 THEN 'Y'
    ELSE 'N'
END DiffSellEndDateFlag

,
'n/a' DiffDiscontinuedDateFlag

,
'n/a' DiffCreationDateFlag

,
'n/a' DiffCreatedByFlag

,
CASE

     WHEN DateDiff(d, A.ModifiedDate, B.ModifiedDate)<>0 THEN 'Y'
     ELSE 'N'
END DiffModifiedDateFlag

,
'n/a' DiffModifiedByFlag

FROM
Production.Product A
    FULL OUTER JOIN Production.Products B
       ON A.ProductID = B.ProductID

WHERE
CASE

     WHEN IsNull(A.Name , '') <> IsNUll(B.Name, '') THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
    WHEN IsNull(A.ProductNumber , '') <> IsNUll(B.ProductNumber, '') THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
     WHEN IsNull(A.MakeFlag , 0) <> IsNUll(B.MakeFlag, 0) THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
     WHEN IsNull(A.FinishedGoodsFlag , 0) <> IsNUll(B.FinishedGoodsFlag, 0) THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
     WHEN IsNull(A.Color , '') <> IsNUll(B.Color, '') THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
     WHEN IsNull(A.SafetyStockLevel , 0) <> IsNUll(B.SafetyStockLevel, 0) THEN 'Y'
     ELSE 'N'
END = 'Y'
OR CASE
     WHEN IsNull(A.StandardCost , 0) - IsNUll(B.StandardCost, 0) NOT BETWEEN -0.05 AND 0.05

      THEN 'Y'
ELSE 'N'
END = 'Y'
OR CASE
      WHEN IsNull(A.ListPrice , 0) - IsNUll(B.ListPrice, 0) NOT BETWEEN -0.05 AND 0.05 THEN 'Y'
      ELSE 'N'
END = 'Y'
OR CASE
      WHEN DateDiff(d, A.SellStartDate, B.StartDate)<>0 THEN 'Y'
      ELSE 'N'
END = 'Y'
OR CASE
      WHEN DateDiff(d, A.SellEndDate, B.EndDate)<>0 THEN 'Y'
      ELSE 'N'
END = 'Y'
OR CASE
       WHEN DateDiff(d, A.ModifiedDate, B.ModifiedDate)<>0 THEN 'Y'
      ELSE 'N'
END = 'Y'


Notes:
1.   The macro doesn’t consider also a ORDER BY clause, though it could be easily added manually
2.   Not all of the join constraints are so simple so that they can be reduced to one or more simple equalities, on the other side we have to consider that the most time consuming task is listing the attributes and the difference flags.
3.    Sometimes it’s easier to create two extracts – in the first being considered all the records from the left table and the matches from the right table (left join), respectively all the records from the right table and the matches from the left table (right join).
4.     Given the fact that the attributes participating in the join clause should in theory match, each pair of such attributes could be merged in one attribute using the formula: IsNull(A.ColumnX, B.ColumnY) As ColumnXY.
5.     In order to show all the data from the two tables and not only the differences, all is needed to do is to change the value of the last parameter from true to false:
=GetComparisonQuery(A2:D23, "Production.Product", "Production.Products", "d", "0.05", false) 
6.     For TimeInterval parameter should be provided only the values taken by DatePart parameter (first parameter) of SQL Server’s DateDiff function.
7.     Please note that no validation is made for the input parameters.

1 comment:

Excel Comparison said...

Very Nice Resource for Web Developers. Thanks For your nice information.