07 February 2010

🗄️Data Management: The Data-Driven Enterprise (Part I: Thoughts on a White Paper)

Data Management
Data Management Series

I read today ‘The Data-Driven Enterprise’ White Paper from Informatica, quite useful paper, especially when it comes from one of the leaders in integration software and services. In this paper the term data-driven enterprise refers to the organizations that are “able to take advantage of their data assets to work faster, better and smarter” [1], in order to achieve this state of art being necessary to” invest in the people, processes and technology needed to know where the data resides, to understand it, to clean it and keep it clean, and to get it to where it is needed, when and how it is needed” [1]. 

It seems that the data-driven enterprise, same as data-driven corporation [2], is just an alternative term for the data-driven organization concept already in use since several good years. Following the DIKW pyramid a data-driven organization follow a four stage evolution from data, to information and further to knowledge and wisdom, of importance being especially how knowledge is derived from data and information, the organizations capable of creating, managing and putting knowledge into use being known as knowledge-based organizations. It’s interesting that the paper makes no direct reference to knowledge and information, focusing on data as asset and possible ignoring information respectively knowledge as asset. I think it would help if the concepts from this paper would have been anchored also within these two contexts.

The paper touches several important aspects related to Data Management, approaching concepts like “value of data”, “data quality”, “data integration”, “business involvement”, “data trust”, “relevant data”, “timely data” “virtualized access”, “compliant reporting”, “Business-IT collaboration”, highlighting the importance of having adequate processes, infrastructure and culture in order to bring more value for the business. I totally agree with the importance of these concepts though I think that there are many other aspects that need to be considered. With such concepts almost all vendors juggle, though what’s often missing is the knowledge/wisdom and method to put philosophies and technologies into use, to redesign an organization’s infrastructure and culture so it could bring the optimum benefit.

Since the appearance of data warehouses concepts, the efficient integration of the various data islands existing within and outside of an organization become a Holy Grail for IT vendors and organizations, though given the fast pace with which new technologies appear this hunt looks more like a Morgan le Fey in the desert. Informatica builds a strong case for data integration in general and for Informatica 9 in particular, their new infrastructure platform targeting to enable organizations to become data-driven by providing a centralized architecture for enforcing data policy and addressing issues like data timeliness, format, semantics, privacy and quality[3]. On the other side the grounds on which Informatica builds its launching strategy could be contra-argumented considering the grey zone they were placed in.

Quantifying Value of Data

How many of the organizations could say that they could quantify (easily) the real value of their data when there is no market value they could be benchmarked against? I would say that data have only a potential value that could increase only with its use, once you learned to explore the data, find patterns and new uses for the data, derive knowledge out of it and use it wisely in order to derive profit and a competitive advantage, and it might take years to arrive there. 

People who witnessed big IT projects like ERP/CRM implementations or data warehousing have seen how their initial expectations were hardly met, how much are they willing to invest in an initiative that could prove its value maybe only years later, especially when there are still many organizations fighting the crisis? How could they create a business case for such a project? How much could they rely on the numbers advanced by vendors and by the nice slogans behind their tools just good for selling a product? 

Taking a quote from the video presentation of Sohaib Abbasi, Chairman and CEO at Informatica, “70% of all current SOA initiatives will be restarted or simply abandoned (Gartner)” [3], and I would bet that many such projects are targeting to integrate the various systems existing in an organization. Once you had several bad such experiences, how much are you willing to invest in a new one?

There are costs that can be quantified, like the number of hours employees spent on maintaining the duplicate data, correcting the issues driven by bad data quality, or more general the costs related to waste, and there are costs that can’t be quantified so easily, like the costs associated with bad decisions or lost opportunities driven by missing data or inadequate reflection of reality. There is another aspect, even if organizations reach to quantify such costs, without having some transparency on how they arrived to the respective numbers it felts like somebody just took out some numbers from a magician’s hat. It would be great if the quantification of such costs is somehow standardized, though that’s difficult to do given the fact that each organization approaches Data Management from its own perspective and requirements.

From Data to Meaning

Reports are used only to aggregate, analyze and navigate data, while it’s in Users attribution to give adequate meaning to the data, and together with the data analyst to find the who, how, when, where, what, why, which and by what means, in a word to understand the factors that impact the business positively/negatively, the correlation between them and how they can be strengthened/mitigated in order to achieve better quality/outcomes.

People want nice charts and metrics that can give them a birds-eye view of the current state, though the aggregated data could easily hide the reality because of the quality of the data, quality of the reports itself, the degree to which they cover the reality. Part of the data-driven philosophy resume in understanding the data, and reacting to data. I met people who were ignoring the data, preferring to take wild guesses, sometimes they were right, other times they were wrong.

From Functionality to Usability

There are Users who once they have a tool they want to find all about its capabilities, play with the tool, find other uses and they could even come with nice to have features. There are also Users who don’t want to bother in getting the data by themselves, they just want the data timely and in the format they need them. The fact that Informatica allows Users to analyze the data by themselves it’s quite of a deal, though as I already stressed in a previous post, you can’t expect from a User to become a data expert overnight, there are even developers that have difficulties in handling complex data analysis requirements. 

The guys from Informatica tried to make simple this aspect in their presentation though it’s not as simple as it seems, especially when dealing with ERP systems like Oracle or SAP that have hundreds of tables, each of them with a bunch of attributes and complex relations, one of the important challenges for developers is to reengineer the logic implemented in such systems. It’s a whole mindset that needs to be developed, there are also best practices that needs to be considered, specific techniques that allow getting the data in the most efficient way.

Allowing users to decide which logic to apply in their reports could prove to be a two edged sword, organizations risking ending up with multiple versions of the same story. It’s needed to align the various reports, bring users on the same page from the point of view of expectations and constraints. On the other side some Users prefer to prepare the data by themselves because they know the issues existing in the data or because they have more flexibility in making the data to look positive.

Trust, Relevance and Timeliness

An important part of Informatica’s strategy is based on data trust, relevancy and timeliness, three important but hard to quantify dimensions of Data Quality. Trust is often correlated with Users’ perception over the overall Data Quality, the degree to which the aggregated data presented in reports can be backed up with detailed data to support them, the visibility they have on the business rules and transformations used. If the Users can get a feeling of the data with click-through, drilldown or drill-through reports, if the business rules and transformations are documented, then most probably that data trust won’t be an issue anymore. Data relevancy and data timeliness are heavily requirement-dependent, for some Users being enough to work with one week old data while others need live data. In a greater or less degree, all data used by the business are relevant otherwise I don’t see why maintaining them.

Software Tools as Enablers

Sometimes being aware that there is a problem and doing something to fix it already brings an amount of value to the business, and this without investing in complex technologies but handling things methodologically and enforcing some management practices – identifying, assessing, addressing, monitoring and controlling issues. I bet this alone could bring a benefit for an organization, and everything starts just by recognizing that there is a problem and doing something to fix the root causes. On the other side software technologies could enable performing the various tasks more efficient and effective, with better quality, less resources, in less time and eventually with lowers costs. Now what’s the value of the saving based on addressing the issue and what’s the value of saving by using a software technology in particular?!

 Software tools like Informatica are just enablers, they don’t guarantee results and don’t eliminate barriers unless people know how to use them and make most of it. For this are needed experts that know the business, the various software tools involved, and good experienced managers to bring such projects on the right track. When the objectives are not met or the final solution doesn’t satisfies all requirements, then people reach to develop alternative solutions, which I categorize as personal solutions – spreadsheets, MS Access applications, an organization ending up with such islands of duplicated data/logic. Often Users need to make use of such solutions in order to understand their data, and this is an area in which Informatica could easily gain adepts.

Business-IT collaboration

There is no news that the IT/IM and other functional departments don’t function as partners, IT initiatives not being adequately supported by the business, while in many IT technology-related initiatives driven by the business at corporate level the IT department is involved only as executor and has little to say in the decision of using one technology or another, many of such initiatives ignoring aspects specific to IT – usability of such a solution, integration with other solutions, nuances of internal architecture and infrastructure. Of course that phrases like “business struggling in working with IT” appear when IT and the business function as separate entities with a minimum of communication, when the various strategies are not aligned as they are supposed to. 

If you’re not informing the IT department on the expectations, and vice-versa, each department will reach to address issues as they appear and not proactively, so there will be no wonder when it takes weeks or months until a solution is provided. The responsiveness of IT is strongly correlated with the resources, the existing infrastructure and policies in place. In addition for the IT to do its work the business has to share the necessary business knowledge, how can you expect to address issues when even the business is not able to articulate adequately the requirements – in many cases the business figures out what they want only when a first solution/prototype is provided. It’s an iterative process, and many people ignore this aspect.

No matter of the slogans and the concepts the vendors juggle with, I’m sorry, but I can’t believe that there is one tool that matches all requirements, that provides a fully integrated solution, that the tool itself is sufficient for eliminating the language and collaboration barriers between the business and IT!

Human Resources & Co.

Many organizations don’t have in-house the human resources needed for the various projects related to Data Management, therefore bringing consultants or outsourcing parts of the projects. A consultant needs time in order to understand the processes existing in an organization, organization’s particularities. Even if business analysts reach to augment the requirements in solid specifications, it’s difficult to cover all the aspects without having a deep knowledge about the architecture used, same as for consultants it’s difficult to put the pieces of the puzzle together especially when more of the pieces are missing. The consultants expect in general to have all the pieces of the puzzles, while the other sides expect consultants to identify the missing pieces.

When outsourcing tasks (e.g. data analysis) or data-related infrastructure (e.g. data warehouses, data marts) an organization risks to lose control over what’s happening, the communication issues being reflected in longer cycle times for issues’ resolution, making everything to become a challenge. There are many other issues related to outsourcing that maybe deserve to be addressed in detail.

The Lack of Vision, Policy and Strategy

An organization needs to have a vision, policy and strategy toward data quality in particular and Data Management in general, in order to plan, enforce and coordinate the overall effort toward quality. Their lack can have unpredictable impact on information systems and reporting infrastructure in particular and on the business as a whole, without it data quality initiatives can have local and narrow scope, without the expected effectiveness, resulting in rework and failure stories. The syntagma “it’s better to prevent than to cure” reliefs the best the philosophy on which Data Management should be centered.

Lack of Ownership

In the context of the lack of policy and strategy can be put also the lack of ownership, though given its importance it deserves a special attention. The syntagma “each employee is responsible for quality” applies to data quality too, each user and department need to take the ownership over the data they have to maintain, for their own or others’ departments scope, same as they have to take the ownership over the reports that make scope of their work, assure their quality and the afferent documentation, over the explicit and implicit islands of knowledge existing.

References:
[1] Informatica. (2009). The Data-Driven Enterprise. [Online] Available from: http://www.informatica.com/downloads/7060_data_driven_wp_web.pdf (Accessed: 6 February 2010).
[2] Herzler. (2006). Eight Aspects of the Data Driven Corporation – Exploring your Gap to Entitlement. [Online] Available from: http://www.hertzler.com/php/portfolio/white.paper.detail.php?article=31 (Accessed: 6 February 2010).
[3] Informatica. (2009). Informatica 9: Infrastructure Platform for the Data-Driven Enterprise, Speaker: Sohaib Abbasi, Chairman and CEO. [Online] Available from: http://www.informatica.com/9/thelibrary.html#page=page-5 (Accessed: 6 February 2010).

05 February 2010

🎡SSIS: Wizarding an SSIS Package (First Magic Class)

I'm sorry to disappoint you, this it’s not another story about Harry Potter! It’s about how to create an SSIS package without any knowledge about SSIS or the components of an SSIS Package with the help of ‘SQL Server Import and Export Wizard’ available from the SQL Server 2008 Management Studio. Here are the steps needed to be followed in order to create a Package that exports the data from a local table to an Excel file.

Step 1: Start SQL Server Import and Export Wizard

From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data. This action will bring the ‘SQL Server Import and Export Wizard’ used, as its name indicates, for importing and exporting data on the fly from/to SQL Server or any other source.
Step 2: Choose a Data Source

In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.
Step 3: Choose a Destination

In ‘Choose a Destination’ step select the Destination, in this case ‘Microsoft Excel’, browse for the Excel file to which you want to export the data (e.g. Products.xls) and check ‘First row has column names’ in case you want to include the column names, then proceed to the next step by clicking ‘Next’.
Step 4: Specify Table Copy or Query

In ‘Specify Table Copy or Query’ step there are two options you could choose: ‘Copy data from one or more tables or views’ or ‘Write a query to specify the data to transfer’ – their names are speaking for itself. The second option allows for more flexibility and you could just copy paste the query used for your expert. In this example just go with the first option and then proceed to the next step by clicking ‘Next’.
Step 5: Select Source Tables and Views

In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data, then proceed to the next step by clicking ‘Next’. If the destination allows it, it’s possible to choose more than one database object.
Step 6: Review Data Type Mapping

In ‘Review Data Type Mapping’ typically you could go with the provided defaults thus you can proceed to the next step by clicking ‘Next’.
Step 7: Save and Run Package

In ‘Save and Run Package’ step you could choose to ‘Run immediately’ the package and/or ‘Save SSIS Package’ to the ‘SQL Server’ or locally to the ‘File System’. Saving the package locally allows you to modify and rerun the package at a later date. For this example check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.
Step 8: Save SSIS Package

In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.
Step 9: Complete the Wizard

The ‘Complete the Wizard’ step allows reviewing the choices made in the previous steps and eventually navigating to the previous steps in order and to the needful changes. Therefore once you proved the details proceed to the next step by clicking ‘Next’.
Step 10: Executing the Package

In ‘Executing the Package’ step the Package is run, the progress being shown as each step. Just close the Package once you reviewed the steps.
If the package run successfully you can go on and check the exported data, apply additional formatting, etc.

Step 11: Inspecting the Package

The package can be reopened and modified in Microsoft Visual Studio or SQL Server Business Intelligence Development Studio, for this just open one of the two environments and from the Main Menu select File/Open File, then browse for the location where the Package (e.g. Export Products.dtsx) was saved and open the file. Here is the package created by the Wizard:
As can be seen the package contains two tasks, the ‘Preparation SQL Task 1’ which creates the table (e.g. vProducts) in the Excel file, and ‘Data Flow Task 1’ which dumps the data in the created table. By double clicking the ‘Data Flow Task 1’ you could see its content: - the ‘Source - vProducts’ OLEDB Data Source holding the connectivity information to the SQL Server and the list of Columns in scope. - the ‘Data Conversion’ Transformation that allows converting the data between Source and Destination given the fact that Excel has different data types than SQL Server - the ‘Destination - vProducts’ Excel Destination holding the connectivity information for the Excel file and the Mappings in place.
You could explore the properties of each object in order to learn more about its attributes used and the values they took.

Ok, so you’ve open the package, how do you run it again? For this is enough to double click on the locally saved Package (e.g. Export Products.dtsx), action that will bring the ‘Execute Package Utility’, and click on the ‘Execute’ button to run the Package.
If you haven’t done any changes to the Excel file to which you moved the data (e.g. Products.xls) and haven’t moved the file, then the package will fail when it attempts to run the ‘the ‘Preparation SQL Task 1’ task because a table with the expected name already exists in the file.
In order to avoid this error to appear you’ll have to delete from the Excel file the sheet (e.g. vProducts) where the data were dumped and save the file. Now you could execute again the package and it should execute without additional issues.

If you’ve removed/deleted the Excel file, even if you get a warning that the Excel file is not available, the package will run in the end without problems. Instead of deleting the sheet manually you could use the File System Task to move the file to other location, though that’s a topic for another post.

🕋Data Warehousing: Star Schema (Definitions)

 "A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to the fact table by a key column." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A star schema is a dimensional data model implemented on a relational database." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A star schema is a set of tables comprised of a single, central fact table surrounded by dimension tables. Each dimension is represented by a single dimension table. Star schemas implement dimensional data structures with denormalized dimensions. Snowflake schemas are an alternative to a star schema design." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A single fact table surrounded by a single hierarchical layer of dimensional tables, in a data warehouse database." (Gavin Powell, "Beginning Database Design", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The instantiation of a dimensional model in a relational database. A star schema consists of a fact table and the dimension tables that it references. The fact table contains facts and foreign keys; the dimension tables contain dimensional attributes by which the facts will be filtered, rolled up, or grouped." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The implementation of a dimensional model in a relational database. The tables are organized around a single central fact table possessing a multi-part key, and each surrounding dimension table has its own primary key." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables. Each dimension table is in a one-to-many relationship with the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"A common form of a dimensional data model, where a fact table is directly linked by foreign keys to several dimension tables." (Craig S Mullins, "Database Administration", 2012)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to and usually joined to the fact table by a key column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"A type of relational database schema that is composed of a set of tables comprising a single, central fact table surrounded by dimension tables. See also dimension table, star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

💎SQL Reloaded: Just in CASE IV (Other Scenarios)

    Two days ago I was looking on the Web on random posts to see whether I missed something important related to CASE function and I stumble over an old post of Craig S. Mullins, in one of the examples (See SQL Statement #5) there is a query similar with the one below:
    As can be seen on more than one branch of a CASE there is a correlated sub-query based on the same table, the aggregated value being used for further calculations. Most probably the respective query had only the purpose to demonstrate a technique, though I’m having mainly two observations related to it:
1. The query returns a label of whether the Stock is High, Normal, Low or there is no Stock, though I have no measure of what the Stock actually is, thus in a first phase I can’t validate whether the query shows the correct data. In case a report is based on the respective query this reduces considerably also report's usability.
2. A select is replicated across several CASE branches fact that I find it complicates query’s complexity reflected in query’s understanding and maintainability, while considering that the logic needs to be kept in synchronization increases the chances for making a mistake. At least in this situation, being involved only one table, the logic is quite simple though what do you do when the correlated sub-query is more complex?

    How about encapsulating query’s logic in a sub-query and join it to the main tables using a left join? This approach would address both issues mentioned above:
    When creating a query there is always an important facts that needs to be considered - query’s performance. The second query is simpler and in theory it should be easier to process, expecting to have at least similar performance as the first version. In this case the ProductID is a foreign key in Production.ProductInventory, the search performed on the respective table having minimum impact on performance. If no index is available on the searched attribute, more likely first query’s performance will decrease considerably. The best approach for mitigating the performance differences between the two queries is to look at Client Statistics and Execution Plan. In what concerns the Client Statistics both queries are having similar performance, while the Execution Plan of the second query excepting the fact that is simpler and, without going into details, it seems the second plan is better.

    Maybe it makes sense to use correlated sub-queries in a CASE only when the correlated sub-queries have distinct logic, though even then I would recommend using a left join instead, this technique allowing more flexibility being possible to show the actual values in the query and even reuse them if needed in additional calculations.

    Talking about correlated queries, I found cases in which two distinct correlated sub-queries with the same logic were used to pull two distinct attributes available in the same table, something like in the below example.
    What if the correlated sub-queries are used again with the CASE function like in the below example?
    Even if maybe the number of calls to the correlated sub-queries is reduced by using a CASE, I would recommend using a left join instead, this technique offering more flexibility, the logic becoming also much easier to maintain and debug.

04 February 2010

💎SQL Reloaded: Just In CASE III (Quest for Performance)

Arranging branches

    Given the fact that the CASE function stops after the first match, in case there are no two expressions that evaluate to true for any same value, the CASE could be optimized for the best performance by arranging the branches in the order of the highest number of matches. For example taking the logic from the 3rd example from Part I and checking the number of occurrences for each value, surprisingly (or maybe not) there are 257 NULL values, 97 ‘L’ values, 82 ‘H’ values, respectively 68 ‘M’ values. If we change the branches in this order, the number of evaluations will decrease from 82*1+68*2+97*3+257*3=1280 to 257*1+97*2+82*3+68*3=901, the difference is small though when applied to big data sets the improvement could pay the effort. Here is the modified logic for Example 3:
    This technique is unfortunately not bulletproof – even if less probable in many cases, the order of the values’ occurrences could change overnight, the number of total evaluations varying in time on the number of occurrences for each value.

Rewriting Simple Unions

    Have you ever seen a set of similar queries whose output is merged with the help of UNION (ALL) operator? I’ve seen so many queries like the below one, in which the only difference between the sub-queries was residing in the WHERE constraint(s) applied and maybe a few additional calculations!

    The example is quite simple, though I think it demonstrates the point. The UNION could be translated to a simple CASE as in the 3rd example. The problem with such queries is that if there is no index on the attribute(s) used is the WHERE constraint, it will be performed a full table scan for each sub-query, quite expensive from a performance standpoint. Even if the respective attribute is indexed, there’s still a pay in performance. Such an approach could be maybe acceptable when in each scenario different tables are joined to the main table, though even then should be checked which one of the two approaches has better performance. A recent example I can recall and in which could have been used several CASE functions instead of the UNION was based on the JOIN of two tables, the query looking something like:


Rewriting Complex Unions

    Even if when the base table is used in several union-based merged queries, it doesn’t sense to apply this technique all the times, especially when dealing with complex queries using multiple joins. Now it depends, there are scenarios and scenarios, is must be always considered the trade in performance, readability and usability of a query when applying any technique. The Person.Address from AdventureWorks is the best way to exemplify two scenarios in which it makes sense and doesn’t makes sense to combine the logic in two union queries. The respective table stored the addresses for Customers (Sales & Individuals), Vendors and Employees, thus supposing that we would like to see the Owners and Owner Type for each Address we would need to create several unions.

    Let’s consider first the Customer-based Addresses – as there are two types of Customers and because the details for each type are stored in different tables, we might end up creating the following query:

    As can be seen the two queries are quite similar, most of the tables joined are the same with 3 exceptions: Sales.Store vs. Sales.Individual & Person.Contact. The query could be rewritten using a CASE and left joins instead of inner joins as follows:


    Similar queries could be written in order to get the Vendor and Employee details, and as can be seen also these queries share many similarities.
    Does it makes sense to use the above technique also in this case? Maybe it would be acceptable to merge the Vendor with Employee Addresses queries, though does it make sense to do merge also the Customer Addresses using CASE & left joins. The queries are not so complex so it’s possible to do that, though I think a decision should be taken only when has been studied the performance of the two approaches.

Pivoting Behavior

    Another technique when a CASE could help eliminate several joins to the same table is presented in a post on List Based Aggregations, the CASE function being used together with aggregated functions in order to cumulate the On Hand for several locations and show it on the same line. This approach was quite useful before the PIVOT operator was introduced in SQL Server 2005 (see Pivot operator example), and is still is in case is needed to select more than 1 attribute for each line.

💎SQL Reloaded: Just In CASE II (Clauses and Joins)

The CASE function could appear in all typical clauses – SELECT, WHERE, GROUP BY, ORDER BY and even in JOIN predicates. If the impact on performance is quite small when CASE is used inside of a SELECT clause, the decrease in performance could be quite high when the CASE is used in searches – WHERE constraints or JOIN predicates. Oracle provides function-based indexes, the smart use of such functions could improve in theory searches performed with CASE/DECODE functions, unfortunately as far I know there is no similar feature available on MS SQL Server family.

The use of CASE functions in WHERE clause is very rare because it could be replaced with several OR operators, the only benefit results maybe from syntax’ clearness or by checking values’ type before using them – the case of numeric and date values. Even if the below example doesn’t make sense from a business point of view, it shows a possible use of CASE within the WHERE clause. 

 
There are systems that misuse an alphanumeric attribute in order to store a numeric or date value, therefore before using the respective values in computations it’s necessary/recommended to test whether the values have the expected data type. I met also cases in which the range of dates covered by a date type in a RDBMS is narrower than a similar date type of other RDBMS (e.g. SQL Server vs. Oracle), in such case the test for valid dates is also necessary. For this purpose the CASE could be used together with the IsDate, respectively IsNumeric, functions as in below examples. 


The same data type handling could be used also in joins when with the misuse of alphanumeric data types, the numeric values represent references to other tables, or the date values could be used in the logic of a join. When the misused attribute is on the main table then the impact on performance is minimal, though when the misused attribute appears in the referenced table then the performance could be impacted considerably because is less likely than an index is built on the respective alphanumeric attribute, a full table scan being done. In order to exemplify this situation I will use the dbo.SplitList table-valued function, which takes a list of value and creates a table out of it, and join it with the Production.Product table. I used to Cast the Value attribute from dbo.SplitList to an integer in order to simulate the handling of numeric values. 


 Even if this use might be look strange for some people, it isn’t utopic, I met many cases in which Users wanted to pull for example all the Products available in the system having the values for a given attribute in a given list, and they wanted to see also the values that had no match. In theory could be easier to rewrite the query as follows, and use in Excel a VLOOKUP function to determine the attributes with no match. 


Of course, in this case it would have been much easier to just use the IN operator and provide the list of values.

03 February 2010

🕋Data Warehousing: Snowflake Schema (Definitions)

"An extension of a star schema such that one or more dimensions are defined by multiple tables. In a snowflake schema, only primary dimension tables are joined to the fact table. Additional dimension tables are joined to primary dimension tables. " (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A variation on dimensional schema design in which dimension tables are further normalized, split into multiple tables based on hierarchies in the data." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A single fact table which joins to many dimension tables, with each dimension normalized as two or more tables." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A variation of the star schema in which the business dimensions are implemented as a set of normalized tables. The resulting diagram resembles a snowflake." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A normalized version of the STAR schema in which dimension tables are partially or fully normalized. Not generally recommended because it compromises query performance and simplicity for understanding." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A star schema in which dimension tables reference other tables." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

💎SQL Reloaded: Ways of Looking at Data III

In the previous posts from the same cycle (see Part I and Part II) I shown how the ranking window functions can be used in SQL Server in order to show the first/last record appearing in a certain partition. A broader set of reports are the ones focusing on the top n records matching a certain criteria, for example the top 10 Vendors/Products based on the amount purchased. Before the introduction of ranking window functions in SQL Server 2005 such requirements were previously addressed using the TOP clause. The following two queries are showing the top 10 Vendors, respectively the top 10 Vendors/Products based on the amount purchased, the first query focusing on the active Vendors while the second on all Vendors – now it depends on the requirements whether only the active or all Vendors are needed, therefore the Users should be always asked about this concern. 

In contrast the use of ranking/aggregated window functions allows more flexibility in selecting not only the first n Vendors, but any interval, thus being possible to use this feature as a simple pagination technique. Here is the first query modified.
As for the second query the TOP clause allows only to select the top 10 Vendors/Products based on the amount purchased, not being possible to select each top n Vendors for each purchased Product as ranking functions allow. Here is the query for the top 3 Vendors for a Product.
The logic for the current example could be reused in further analysis, therefore it makes sense to encapsulated it in a database object. A view would do, though if we would like to use the number of Vendors as parameter, then it’s more appropriate to use a table-valued function or a stored procedure for this. Even if in theory stored procedures provide better performance, if we like to include the above logic in further queries then creating a table-valued function is the best choice.
Once the table-valued function was created we can go on and use it in other queries, for example here is a simple query showing maximum 5 vendors for each product.
Most probably the Users used with the pivoting functionality of Excel would ask you if it’s possible to show all the Vendors together in one line. Starting with SQL Server 2005 Microsoft introduced the PIVOT operator though it can pivot only one attribute, fortunately there is an alternative using a GROUP BY together with the CASE function.
The downside of this technique is that you can work only with a predefined number of Vendors per Product (here 3), on the other side there are few the cases in which more than 5 Vendors exist for the same Product. An alternative would be to process the data using a programming language as VBA, not a complicated thing to do at all. In almost 10 years of experience I found only a few cases in which the functionality provided by RDBMS needed to be extended with additional coding outside of the RDBMS (e.g. Excel, ASP), mainly when the data needed to be organized beyond the simple tabular structure.

🕋Data Warehousing: Data Mart [DM] (Definitions)

"A subset of the contents of a data warehouse, stored within its database. A data mart tends to contain data focused at the department level, or on a specific business area. It is frequently implemented to manage the volume and scope of data." (Microsoft Corporation, "Microsoft SQL Server 7.0 System Administration Training Kit", 1999)

"A data warehouse, or repository, whose scope is limited to a single subject area." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"A type of data warehouse with data specifically designed for a defined set of functions." (Margaret Y Chu, "Blissful Data ", 2004)

[dependent data mart:] "A data mart that obtains its source data from an Enterprise Data Warehouse." (Margaret Y Chu, "Blissful Data ", 2004)

[independent data mart:] "A data mart that obtains its source data from operational systems or other external media." (Margaret Y Chu, "Blissful Data ", 2004)

"A database that contains a copy of operational data, organized to support analysis of a business process. A data mart may be a subject area within an enterprise data warehouse, or an analytic database that is departmentally focused. When not planned as part of an enterprise data warehouse, a data mart may become a stovepipe. When deployed as an adjunct to a normalized data warehouse, a data mart may contain aggregated data. When built around a conformance bus, the data mart is neither a stovepipe nor an aggregation." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

[stovepipe data mart:] "A departmentally focused data warehouse implementation that does not interoperate with other subject areas. Stovepipes are avoided through the design of a data warehouse bus - a set of conformed dimensions used consistently across subject areas." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A platform that maintains data for analysis by a single organization or user group for a specific set of business purposes." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"A departmentalized structure of data feeding from the data warehouse where data is denormalized based on the department’s need for information." (William H Inmon & Anthony Nesavich, "Tapping into Unstructured Data", 2007)

"A data mart is a specialized database containing a subset of data from a data warehouse that is needed for a particular business purpose. A data mart is used for reporting and analysis of business data." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"A smaller data warehouse that holds data of interest to a particular group." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A specialized type of data warehouse that works with a specific set of data to answer a specific need. A data mart is designed to provide quick, easy access to crucial data." (Stuart Mudie et al, "BusinessObjects™ XI Release 2 for Dummies", 2008) 

"A collection of related data from internal and external sources, transformed, integrated, and stored for the purpose of providing strategic information to a specific set of users in an enterprise." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A database in a data warehouse configuration that holds a subset of data specifically organized for a particular kind of reporting." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"A database structured for specific analysis and historical reporting needs." (David Lyle & John G Schmidt, "Lean Integration", 2010)

"A smaller, more specialized version of a data warehouse that includes data from a specific functional area or department." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A decision support database supporting Business Intelligence in a limited subject area, using a dimensional data model design." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Small data warehouse designed to support a department or SBU." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed, 2011)

"A subset of a data warehouse that is designed to focus on a specific set of business information." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"A subset of a data warehouse that’s usually oriented to a business group or team." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[dependent data mart:] "A data mart whose sole source of data is the data warehouse; a dependent data mart is a component of the corporate information factory" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[independent data mart:] "A data mart whose source data comes directly from legacy systems, rather than being sourced by a data warehouse" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"Data organized to support specific needs of a user community." (Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Andrew Pham et al, "From Business Strategy to Information Technology Roadmap", 2016)

"A focused collection of operational data that is usually confined to a specific aspect or subject of a business, such as customers, products, or suppliers. It is a more focused decision support data store than a data warehouse." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"A subset of a data warehouse that allows data to be accessed and customized by specific business functions." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"An analytical database built for and used by a business unit or department to slice and dice for analytical reporting and analysis." (Tiffany Pham et al, "From Business Strategy to Information Technology Roadmap", 2018)

"A subset of a data warehouse that contains data that is tailored and optimized for the specific reporting needs of a department or team. A data mart can be a subset of a warehouse for an entire organization, such as data that is contained in online analytical processing (OLAP) tools." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"a subset of the data warehouse used for a specific purpose. Data marts are then department-specific or related to a single line of business (LoB)."  (Francesco Corea, "An Introduction to Data: Everything You Need to Know About AI, Big Data and Data Science", 2019)

"A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers." (The Data Warehousing Institute)

"A database, usually smaller than a data warehouse, designed to help managers make strategic decisions about their business by focusing on a specific subject or department." (Microstrategy)

"A subset of the contents of a data warehouse that tends to contain data focused at the department level, or on a specific business area." (Microsoft)

"A simple data repository that houses data of a specific discipline." (Solutions Review)

"A data mart is a curated subset of data often generated for analytics and business intelligence users. Data marts are often created as a repository of pertinent information for a subgroup of workers or a particular use case." (snowflake) [source]

"A data mart is a subject-oriented database that is often a partitioned segment of an enterprise data warehouse. The subset of data held in a data mart typically aligns with a particular business unit like sales, finance, or marketing." (Talend) [source]

"A data mart is a subset of data from an enterprise data warehouse in which the relevance is limited to a specific business unit or group of users." (Informatica) [source]

"A data mart is a subset of data stored within the overall data warehouse, for the needs of a specific team, section or department within the business enterprise. […] Data marts make it much easier for individual departments to access key data insights more quickly and helps prevent departments within the business organization from interfering with each other’s data." (Sisense) [source]

"A data mart is the access layer of a data warehouse that is used to provide users with data. Data marts are often seen as small slices of the data warehouse. Data warehouses typically house enterprise-wide data, and information stored in a data mart usually belongs to a specific department or team." (Logi Analytics) [source]

"A data mart serves the same role as a data warehouse, but it is intentionally limited in scope. It may serve one particular department or line of business." (Oracle)

"The data mart is a subject-oriented slice of the data warehouse logical model serving a narrow group of users. Many data marts only need a subset of data from the full tables in the data warehouse." (Teradata) [source]

💎SQL Reloaded: Just In CASE I (Introduction)

    One of the functions I often use in the my daily work is CASE, a powerful combination between “Iif” function and “Case” conditional statement available in VB/VBA. It evaluates a list of conditions reducible to false/true and returns a value on the branch for which a match was found, allowing thus to create conditional-based code. There are two syntax forms in use: the Simple CASE that can be used to evaluate the values of an attribute, and the Searched CASE used to evaluate expressions.
SQL Server Books Online     A CASE could have one or more WHEN branches and only one or no ELSE branch, the WHEN branches being evaluated in the order they were given, while the ELSE branch is always last evaluated. Once an expression on WHEN branch is evaluated to true, the function returns the value evaluated from result_expression and stops the searching. If no match was found then the function evaluates the result_expression from ELSE branch, and if the ELSE branch is missing a value of NULL is return. Actually it’s simpler than is sounds.

    CASE combined with other operators and functions enable to create powerful transformations, many of the transformations done during ETL processing could be solved with a CASE. The following SQL script based on Production.Product table from AdventureWorks database relies heavily on the use of CASE function in order to transform the data to a form intelligible for the Users or any other consumers.
    The simplest use of CASE function is the translation of bit values to more meaningful values, typically ‘Y’/’N’ or ‘Yes’/’No’. It’s the case of the code chunks marked with Example 1 & 2 in the above query, the first example showing the use of Simple CASE syntax, while the second one of a Searched CASE.
    The third example shows the Simple CASE syntax for multiple values of the same attribute, being provided a branch for each of the values taken by Class attribute (‘H’, ‘M’, ‘L’) and translating them to more meaningful values (‘High’, ‘Medium’, ‘Low’), the ‘n/a’ (not available) value being returned for the other cases.
    The alternative syntax could be used too, though for such simple evaluations the previous syntax form is easier to use.
    For the ones knowing the IsNull or COALESCE function the 4th example doesn’t make sense, a simpler writing for the same would be IsNull(Color, ‘n/a’). What the example tries to demonstrate is the use of comparison against NULL values using the IS NULL clause.
    In the 5th and 6th example the matching expressions make use of IN and BETWEEN operators, any other operators could be used in this context. In the 5th example the BETWEEN operator is working with character data type therefore is used the lexical comparison – for such cases it has to be taken into account that if Size would have taken also the value ‘334’ it would be evaluated on the second branch and return the value ‘S’!
    More general, the CASE function could be used to split a set of alphanumeric values or an interval of numeric values in buckets or evaluate specific expressions on each branch, like functions that have different formulas on each interval. If the previous examples work with sets of alphanumeric values, the 6th example uses the StandardCost to derive several Cost Classes (buckets split), the intervals considered are (-infinite, 0], (0, 50], (50, 250], (250, 1000], [1000, 10000] and (10000, infinite).
    Somebody would remark: “ok, but the values, 50, 250, and 1000 are considered on more than one branch!”. Correct, but should not be forgotten that the CASE function stops after the first expression evaluated to true, so if the StandardCost is 50 the second branch will be the last branch processed, so the respective piece of code is correct. In order to avoid any confusion we could have written the following lengthier and equivalent code:
    The 7th example is another example working with intervals based on the difference in days between the SellEndDate and the current date.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.