SQL Troubles

10 February 2010

⌛PL/SQL Reloaded: Hierarchies (The Oracle Solution)

In yesterday’s post on hierarchy-based reports I was showing, using a BOM example, how hierarchy-related problems could be approached using common table expressions introduced starting with SQL Server 2005. Oracle’s approach to hierarchical queries is based on CONNECT BY clause that allows to specify the join constraint between the table(s) used in the recursive logic and previous output. If in SQL Server the logic is split in two parts – the anchor member sub-query representing the starting point of the query, respectively the recursive member, the query on which the recursive technique is applied, in Oracle the two are combined in only one query with the help of START WITH clause. Here’s the basic structure for building a hierarchy based on BillOfMaterials table coming with AdventureWorks.

The START WITH ASS.productassemblyid IS NULL and TRUNC(NVL(ass.enddate, SYSDATE))>=TRUNC(SYSDATE) constraints function as initialization for the recursive logic, being ignore in next iterations, while the CONNECT BY PRIOR ass.componentid = ass.productassemblyid and TRUNC(NVL(ass.enddate, SYSDATE))>=TRUNC(SYSDATE) constraints function as join constraints between the prior result set and current logic. Maybe somebody is asking why the ‘Active BOM’ constraint if specified in both START WITH, respectively CONNECT BY clauses and not in the WHERE clause? If we are interested only in the Active BOMs then the corresponding constraint must be added to the initiating clause, and is added to the CONNECT BY clause because we need to assure that we take only the active children. If we move the ‘Active BOM’ constrain from the CONNECT BY to WHERE clause then most probably we’ll get a totally different number of records, respectively 9525 vs. 8730 how much the above query returns. The easiest way to check whether the logic is correct is to do include the Activer ecords in an inline view and construct with it the BOM.

As the above query returns 8730 records it seems that the above logic is correct.

In order to match the ‘CTE-Based BOM structure’ query from the previous post, the Oracle query needs several further changes, namely to construct the Path, add the top-most ComponentID, calculate the TotalQty, and add the Product details. The Path can be constructed using the SYS_CONNECT_BY_PATH function using as parameters the attribute needed to be concatenated (e.g. componentid) and the delimiter (e.g. ‘\’). The top-most ComponentID can be retrieved using the CONNECT_BY_ROOT operator. Unfortunately Oracle doesn’t provides a straight way to calculate the TotalQty, though there is an alternative, using again the SYS_CONNECT_BY_PATH function with the PerAssemblyQty attribute and multiplication sign ‘*’ as parameters, this smart trick allowing to evaluate the final expression (e.g. ‘1*1*2*5’ will be evaluated to 10); because the extract is typically saved to excel, the final concatenation could be written as an Excel formula expression (e.g. ‘=1*1*2*5’). Here is the final query:

The specification of an ORDER BY is not always required because typically the output of a hierarchical query in Oracle is sorted as if it were used the depth-first search, though if I remember correctly there are exceptions too.

Sometimes it’s useful to create a concatenation of Products in order to see the actual structure of the BOM (e.g. ‘\BK-R93R-62\BB-9108\BE-2349\BA-8327’) or to use to use incremental spaces instead (see. IndentPath), for this needing to bring the Product join for ComponentID into the recursive logic.

By comparing the SQL Server vs. Oracle implementation for hierarchical queries, I would say that Oracle’s implementation is much simpler to use from the creational point of view with a trade off in functionality, several functions being required in order to extract information from top-most record or to concatenate attributes. On the other side SQL Server provides a more elegant solution that allows more functional flexibility in the detriment of code duplication – the anchor vs. the recursive member.

Note:
In case you need to load the Product and BillOfMaterials tables in Oracle, you could do this by using the ‘SQL Server Import and Export Wizard’, for the steps see the SQL Server to Oracle Data Export – Second Magic Class tutorial.

09 February 2010

🕋Data Warehousing: Role-playing Dimension (Definitions)

"A dimension for which multiple foreign key columns exist in a single fact table to represent different contexts for the same dimension. For example, time is a common role-playing dimension with each date in the fact table associated with a different event, such as order date, ship date, and delivery date." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"Instances of a dimension that legitimately has more than one value for a given business transaction, such as order date and shipped date. Each attribute with the dimension is uniquely identified to enable easy differentiation between the different roles, such as Order Date, Order Quarter and Shipped Date, and Shipped Quarter." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A single database dimension joined to the fact table on a different foreign keys to produce multiple cube dimensions." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The situation where a single physical dimension table appears several times in a single fact table. Each of the dimension roles is represented as a separate logical table with unique column names through views." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

🕋Data Warehousing: Dimensional Model (Definitions)

"A dimensional model is a form of data modeling that packages data according to specific business queries and processes. The goals are business user understandability and multidimensional query performance." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A data model influenced by performance considerations and optimized for data access and retrieval. This is a form of entity-relationship (ER) modeling that’s limited to working with “measurement” types of subjects and has a prescribed set of allowable structures." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A dimensional model contains a central fact table and a set of surrounding dimension tables, each corresponding to one of the components or dimensions of the fact table." (James Yao et al, "Development and Design Methodologies in DWM", 2008)

"A data model organized for the purpose of user understandability and high performance. In a relational database, a dimensional model is a star join schema characterized by a central fact table with a multi-part key. The components of this key are joined to a set of dimension tables, each defined by its own primary keys. In a multi-dimensional database, a dimensional model is the database cube." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A data model that represents data in a star-like structure of only one-to-many relationships, where each entity has either all relationships having the ‘one’ side or the ‘many’ side." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A specialized type of physical data model particular to a retrieval-only database design, commonly used in Data Warehouses and data marts, where de-normalized fact tables are linked to dimension tables. Star schemas and snowflake schemas are examples of dimensional models." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

💎SQL Reloaded: Ways of Looking at Data IV (Hierarchies)

Introduction

There is a special type of entities that when related to each other, the relations put together form a hierarchy. It’s the case of assemblies and the components they are made of, a component could be at its turn an assembly formed from smaller components, resulting thus a structure known as Bill of Materials (BoM). Take for example a laptop made of parts like the keyboard, modem, hard disk, CD-ROM, microprocessor, volatile memory, graphic card, display, to name just a few of the main components, each of these parts are at their turn are made of smaller components that at their turn could be made from smaller components, and so on. Another popular example is the one of employees in an organization, following the line of command from CEO to the lowest person as importance, a person could be chief for more employees and have only one chief, resulting thus another hierarchical structure.

Most of the hierarchies I met were having maximum 7 levels, where a level represents a parent-child relation, typically there is no need for higher hierarchies. In a RDBMS such relations are represented with self-referencing tables or self-referencing structures in case more than one table is used to represent the relation. The problem with such structures is that in order to get the whole hierarchy a join must be created for each relation, though as the number of levels could vary, such queries can become quite complex. Therefore RDBMS came with their own solutions in order to work with hierarchies

There are several types of problems specific to hierarchies: 1. Building the hierarchy behind of an item in the structure
2. Identifying the top-most parent of an item in the structure
3. Identifying the top-most parent of an item in the structure and building the hierarchy behind it – a combination of the previous mentioned problems

Building the hierarchy

SQL Server 2005 introduced the common table expressions (CTE), a powerful feature that allows building whole hierarchies in a simplistic manner. From a structural point of view a CTE has two elements – the anchor member defining the starting point of the query, respectively the recursive member that builds upon the anchor member. The starting point of the query in this case is the entity or entities for which the structure is built, and here there are 2 situations – we are looking for all topmost entities or for the entities matching a certain criteria, typically looking for one or more Product Numbers. The difference between the two scopes could lead to different approaches – whether to include the Product information in the CTE or to use the CTE only to built the hierarchical structure, and add only in the end the production information. Before things would become fuzzier let’s start with an example, here are the two approaches for defining the anchor member, the first query identifying the top-most assemblies, while the second searches for the components in the BOM that matches a criteria (e.g. the Product’s name starts with BK-R93R).

So we have the starting point, how about the recursive member? Without using a CTE, normally we would have to join either of the two queries again with the BillofMaterials table in order to retrieve the components for the Products already retrieved, then the final result must be joined again with the BillofMaterials, and repeat this step over and over again until a predefined number of iterations are met, or until no components are retrieved anymore. Actually that’s what a CTE does too, and in order to make things simpler, a CTE has a name that it’s reused in the CTE in order to handle the recursive part, and outside of the CTE in order to use the final result. Supposing that CTE’s name is BOM here is the recursive member:

The anchor and recursive member are merged in the CTE with the help of a UNION therefore the number of attributes must be the same and their data type must be compatible. Here is the final query:

    The TotalQty attribute has been introduced in order to show the cumulated quantity for each component, for example if the Assembly has the PerAssemblyQty 2 and the Component has the PerAssemblyQty 10 then the actual total quantity needed is 2*10.
    The Path attribute has been introduced in order to sort the hierarchy lexically based on the sequence formed from the ComponentID, it could have been used instead the Product Number.

Identifying the top-most parent

    The CTE can be used also for identifying the top-most parent of a given Product or set of Items, the second anchor query member could be used for this, while the recursive member would involve the reverse join between BOM and BillOfMaterials. Here is the final query:

Given the fact that the final CTE output includes the output from each iteration when only the data from the last iteration are needed, this might not be the best solution. I would expect that the simulation of recursion with the help of a temporary table would scale better in this case.

Another interesting fact to note is that in this case more assemblies are found to match the criteria of top-most item, meaning that the BOMs are not partitioned – two BOMs could have common elements.

Happy coding!

Previous Post <<||>> Next Post

07 February 2010

🎡⌛SSIS: SQL Server to Oracle Data Export (Second Magic Class)

In a previous post I described the steps needed to be followed in order to export the data from SQL Server to Excel, this time I will describe the steps needed in order to export the data from SQL Server to Oracle. For this is requested to have Oracle database product installed on your computer, or in case you are having already an Oracle database server on the network then you’ll need only the Oracle Client installed locally, and a SQL Development Tool like SQL Developer or TOAD, with a plus for the later.

Like in the previous tutorial we’ll use the Production.Product table from AdventureWorks database coming with SQL Server.

Step 1: Create an Oracle User

If you already have an Oracle User created then you could skip this step, unless you want to create a User only for the current tutorial or for loading other AdventureWorks tables in Oracle. For this we will create first a permanent tablespace (e.g. AdventureWorks), an allocation of space in the database, then create the actual User (e.g. SQLServer) on the just created tablespace, action the will create a schema with the same name as the User, and finally grant the User ‘create session’, ‘create materialized view’ and ‘create table’ privileges.

Open your SQL Developer tool of choice and connect to the Oracle database, then type the following statements and run them one by one, do not forget to provide a strong-typed password in IDENTIFIED BY clause, then create a new connection using the just created User.

-- Create TableSpace
CREATE BIGFILE TABLESPACE AdventureWorks
DATAFILE 'AdventureWorks.dat'
SIZE 20M AUTOEXTEND ON; 

-- Create User CREATE USER SQLServer 
IDENTIFIED BY <your_password>
DEFAULT TABLESPACE AdventureWorks 
QUOTA 20M ON AdventureWorks; 

-- Grant privileges
GRANT CREATE SESSION
, CREATE MATERIALIZED VIEW
, CREATE TABLE TO SQLServer;

Step 2: Start SQL Server Import and Export Wizard

From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data

Step 3: Choose a Data Source

In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.

Step 4: Choose a Destination

In ‘Choose a Destination’ step select the Destination, in this case ‘Oracle Provider for OLE DB’ and then click on ‘Properties’ in order to provide the connectivity details, in ‘Data Link Properties’ dialog just opened, enter the Data Source (the SID of your Oracle database), the ‘User name’ (e.g. SQLServer), the ‘Password’ and check ‘Allow saving password’ checkbox, without this last step being not possible to connect to the Oracle database. Then test the connection by clicking the ‘Test Connection’, and if the ‘Test connection succeeded’ proceed to the next step.

Note:
Excepting ‘Oracle Provider for OLE DB’ there are three other drivers that allows you to connect to an Oracle database: ‘Microsoft OLE DB Provider for Oracle’, ‘.Net Framework Data Provider for Oracle’, respectively ‘Oracle Data Provider for .Net’, each of them coming with their own downsizes and benefits.

Step 5: Specify Table Copy Or Query

In this step choose: ‘Copy data from one or more tables or views’ option.

Step 6: Select Source Tables and Views

In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data. You could go with the provided Destination table, though if you are using the current settings Oracle table’s name will be “Product” (including quotes) and not Product as we’d expected. There are two options, go with the current settings and change table’s name and columns, they having also this problem, in Oracle, or create the table manually in Oracle. I prefer the second option because I could use the SQL script automatically generated by Oracle, for this with the just chosen table selected click on ‘Edit Mappings’ button that will bring the ‘Column Mappings’ dialog and click the ‘Edit SQL’ button, the query appearing as editable in ‘Create Table SQL Statement’ dialog.

Copy the query and copy paste it in the Oracle SQL Developer, then do a replace of the quotes and delete the rowguid column as we don’t needed, in plus when attempting to dump the data, given the incompatibility of data types, an error will be raised. There is a second change we’ll have to make, to rename the Size column to ProductSize, this because Size is a reserved keyword.

CREATE TABLE sqlserver.Product (
ProductID INTEGER NOT NULL,
Name NVARCHAR2(50) NOT NULL,
ProductNumber NVARCHAR2(25) NOT NULL,
MakeFlag NUMBER NOT NULL,
FinishedGoodsFlag NUMBER NOT NULL,
Color NVARCHAR2(15),
SafetyStockLevel INTEGER NOT NULL,
ReorderPoint INTEGER NOT NULL,
StandardCost NUMBER NOT NULL,
ListPrice NUMBER NOT NULL,
ProductSize NVARCHAR2(5),
SizeUnitMeasureCode NCHAR(3),
WeightUnitMeasureCode NCHAR(3),
Weight NUMBER,
DaysToManufacture INTEGER NOT NULL,
ProductLine NCHAR(2),
Class NCHAR(2),
Style NCHAR(2),
ProductSubcategoryID INTEGER,
ProductModelID INTEGER,
SellStartDate TIMESTAMP NOT NULL,
SellEndDate TIMESTAMP,
DiscontinuedDate TIMESTAMP,
ModifiedDate TIMESTAMP NOT NULL)

Unfortunately this approach has a downside, we have to go two steps back in Choose a Destination, then click on ‘Properties’ again and without doing any change test the connection by clicking the ‘Test Connection’, then proceed forward to the current step, and in ‘Select Source Tables and Views’ step select again the database object (e.g. Production.vProducts) from which you’ll export the data but this time in Destination dropdown control search for the table just created (e.g. sqlserver.Product).

Because we have removed the rowguid column from the table created in Oracle, we’ll have to change the mappings, for this click on ‘Edit Mappings’ and in ‘Mapping’ list box from ‘Column Mappings’, in the Destination dropdown next to rowguid column select ‘’, thus the respective column being ignored from the mapping. Accept the change and proceed to the next step.

Note:
You could choose to export more than one table, though for this you’ll have to create manually each table in Oracle and then map the created tables in ‘Select Source Tables and Views’ step.

Step 7: Review Data Type Mapping

Step 8: Save and Run Package

In ‘Save and Run Package’ check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.

Step 9: Save SSIS Package

In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products To Oracle) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.

Step 10: Complete the Wizard

Step 11: Executing the Package

Step 12: Checking the Data

If the previous step completed successfully, then you can copy the following statement in Oracle SQL Developer and execute it, the table should be already populated.

SELECT * 
FROM sqlserver.Product

Note:
If you used other schema to connect to, then you’ll have to replace the above schema (e.g. SQLServer) with your schema.

🗄️Data Management: The Data-Driven Enterprise (Part I: Thoughts on a White Paper)

Data Management Series

I read today ‘The Data-Driven Enterprise’ White Paper from Informatica, quite useful paper, especially when it comes from one of the leaders in integration software and services. In this paper the term data-driven enterprise refers to the organizations that are “able to take advantage of their data assets to work faster, better and smarter” [1], in order to achieve this state of art being necessary to” invest in the people, processes and technology needed to know where the data resides, to understand it, to clean it and keep it clean, and to get it to where it is needed, when and how it is needed” [1].

It seems that the data-driven enterprise, same as data-driven corporation [2], is just an alternative term for the data-driven organization concept already in use since several good years. Following the DIKW pyramid a data-driven organization follow a four stage evolution from data, to information and further to knowledge and wisdom, of importance being especially how knowledge is derived from data and information, the organizations capable of creating, managing and putting knowledge into use being known as knowledge-based organizations. It’s interesting that the paper makes no direct reference to knowledge and information, focusing on data as asset and possible ignoring information respectively knowledge as asset. I think it would help if the concepts from this paper would have been anchored also within these two contexts.

The paper touches several important aspects related to Data Management, approaching concepts like “value of data”, “data quality”, “data integration”, “business involvement”, “data trust”, “relevant data”, “timely data” “virtualized access”, “compliant reporting”, “Business-IT collaboration”, highlighting the importance of having adequate processes, infrastructure and culture in order to bring more value for the business. I totally agree with the importance of these concepts though I think that there are many other aspects that need to be considered. With such concepts almost all vendors juggle, though what’s often missing is the knowledge/wisdom and method to put philosophies and technologies into use, to redesign an organization’s infrastructure and culture so it could bring the optimum benefit.

Since the appearance of data warehouses concepts, the efficient integration of the various data islands existing within and outside of an organization become a Holy Grail for IT vendors and organizations, though given the fast pace with which new technologies appear this hunt looks more like a Morgan le Fey in the desert. Informatica builds a strong case for data integration in general and for Informatica 9 in particular, their new infrastructure platform targeting to enable organizations to become data-driven by providing a centralized architecture for enforcing data policy and addressing issues like data timeliness, format, semantics, privacy and quality[3]. On the other side the grounds on which Informatica builds its launching strategy could be contra-argumented considering the grey zone they were placed in.

Quantifying Value of Data

How many of the organizations could say that they could quantify (easily) the real value of their data when there is no market value they could be benchmarked against? I would say that data have only a potential value that could increase only with its use, once you learned to explore the data, find patterns and new uses for the data, derive knowledge out of it and use it wisely in order to derive profit and a competitive advantage, and it might take years to arrive there.

People who witnessed big IT projects like ERP/CRM implementations or data warehousing have seen how their initial expectations were hardly met, how much are they willing to invest in an initiative that could prove its value maybe only years later, especially when there are still many organizations fighting the crisis? How could they create a business case for such a project? How much could they rely on the numbers advanced by vendors and by the nice slogans behind their tools just good for selling a product?

Taking a quote from the video presentation of Sohaib Abbasi, Chairman and CEO at Informatica, “70% of all current SOA initiatives will be restarted or simply abandoned (Gartner)” [3], and I would bet that many such projects are targeting to integrate the various systems existing in an organization. Once you had several bad such experiences, how much are you willing to invest in a new one?

There are costs that can be quantified, like the number of hours employees spent on maintaining the duplicate data, correcting the issues driven by bad data quality, or more general the costs related to waste, and there are costs that can’t be quantified so easily, like the costs associated with bad decisions or lost opportunities driven by missing data or inadequate reflection of reality. There is another aspect, even if organizations reach to quantify such costs, without having some transparency on how they arrived to the respective numbers it felts like somebody just took out some numbers from a magician’s hat. It would be great if the quantification of such costs is somehow standardized, though that’s difficult to do given the fact that each organization approaches Data Management from its own perspective and requirements.

From Data to Meaning

Reports are used only to aggregate, analyze and navigate data, while it’s in Users attribution to give adequate meaning to the data, and together with the data analyst to find the who, how, when, where, what, why, which and by what means, in a word to understand the factors that impact the business positively/negatively, the correlation between them and how they can be strengthened/mitigated in order to achieve better quality/outcomes.

People want nice charts and metrics that can give them a birds-eye view of the current state, though the aggregated data could easily hide the reality because of the quality of the data, quality of the reports itself, the degree to which they cover the reality. Part of the data-driven philosophy resume in understanding the data, and reacting to data. I met people who were ignoring the data, preferring to take wild guesses, sometimes they were right, other times they were wrong.

From Functionality to Usability

There are Users who once they have a tool they want to find all about its capabilities, play with the tool, find other uses and they could even come with nice to have features. There are also Users who don’t want to bother in getting the data by themselves, they just want the data timely and in the format they need them. The fact that Informatica allows Users to analyze the data by themselves it’s quite of a deal, though as I already stressed in a previous post, you can’t expect from a User to become a data expert overnight, there are even developers that have difficulties in handling complex data analysis requirements.

The guys from Informatica tried to make simple this aspect in their presentation though it’s not as simple as it seems, especially when dealing with ERP systems like Oracle or SAP that have hundreds of tables, each of them with a bunch of attributes and complex relations, one of the important challenges for developers is to reengineer the logic implemented in such systems. It’s a whole mindset that needs to be developed, there are also best practices that needs to be considered, specific techniques that allow getting the data in the most efficient way.

Allowing users to decide which logic to apply in their reports could prove to be a two edged sword, organizations risking ending up with multiple versions of the same story. It’s needed to align the various reports, bring users on the same page from the point of view of expectations and constraints. On the other side some Users prefer to prepare the data by themselves because they know the issues existing in the data or because they have more flexibility in making the data to look positive.

Trust, Relevance and Timeliness

An important part of Informatica’s strategy is based on data trust, relevancy and timeliness, three important but hard to quantify dimensions of Data Quality. Trust is often correlated with Users’ perception over the overall Data Quality, the degree to which the aggregated data presented in reports can be backed up with detailed data to support them, the visibility they have on the business rules and transformations used. If the Users can get a feeling of the data with click-through, drilldown or drill-through reports, if the business rules and transformations are documented, then most probably that data trust won’t be an issue anymore. Data relevancy and data timeliness are heavily requirement-dependent, for some Users being enough to work with one week old data while others need live data. In a greater or less degree, all data used by the business are relevant otherwise I don’t see why maintaining them.

Software Tools as Enablers

Sometimes being aware that there is a problem and doing something to fix it already brings an amount of value to the business, and this without investing in complex technologies but handling things methodologically and enforcing some management practices – identifying, assessing, addressing, monitoring and controlling issues. I bet this alone could bring a benefit for an organization, and everything starts just by recognizing that there is a problem and doing something to fix the root causes. On the other side software technologies could enable performing the various tasks more efficient and effective, with better quality, less resources, in less time and eventually with lowers costs. Now what’s the value of the saving based on addressing the issue and what’s the value of saving by using a software technology in particular?!

Software tools like Informatica are just enablers, they don’t guarantee results and don’t eliminate barriers unless people know how to use them and make most of it. For this are needed experts that know the business, the various software tools involved, and good experienced managers to bring such projects on the right track. When the objectives are not met or the final solution doesn’t satisfies all requirements, then people reach to develop alternative solutions, which I categorize as personal solutions – spreadsheets, MS Access applications, an organization ending up with such islands of duplicated data/logic. Often Users need to make use of such solutions in order to understand their data, and this is an area in which Informatica could easily gain adepts.

Business-IT collaboration

There is no news that the IT/IM and other functional departments don’t function as partners, IT initiatives not being adequately supported by the business, while in many IT technology-related initiatives driven by the business at corporate level the IT department is involved only as executor and has little to say in the decision of using one technology or another, many of such initiatives ignoring aspects specific to IT – usability of such a solution, integration with other solutions, nuances of internal architecture and infrastructure. Of course that phrases like “business struggling in working with IT” appear when IT and the business function as separate entities with a minimum of communication, when the various strategies are not aligned as they are supposed to.

If you’re not informing the IT department on the expectations, and vice-versa, each department will reach to address issues as they appear and not proactively, so there will be no wonder when it takes weeks or months until a solution is provided. The responsiveness of IT is strongly correlated with the resources, the existing infrastructure and policies in place. In addition for the IT to do its work the business has to share the necessary business knowledge, how can you expect to address issues when even the business is not able to articulate adequately the requirements – in many cases the business figures out what they want only when a first solution/prototype is provided. It’s an iterative process, and many people ignore this aspect.

No matter of the slogans and the concepts the vendors juggle with, I’m sorry, but I can’t believe that there is one tool that matches all requirements, that provides a fully integrated solution, that the tool itself is sufficient for eliminating the language and collaboration barriers between the business and IT!

Human Resources & Co.

Many organizations don’t have in-house the human resources needed for the various projects related to Data Management, therefore bringing consultants or outsourcing parts of the projects. A consultant needs time in order to understand the processes existing in an organization, organization’s particularities. Even if business analysts reach to augment the requirements in solid specifications, it’s difficult to cover all the aspects without having a deep knowledge about the architecture used, same as for consultants it’s difficult to put the pieces of the puzzle together especially when more of the pieces are missing. The consultants expect in general to have all the pieces of the puzzles, while the other sides expect consultants to identify the missing pieces.

When outsourcing tasks (e.g. data analysis) or data-related infrastructure (e.g. data warehouses, data marts) an organization risks to lose control over what’s happening, the communication issues being reflected in longer cycle times for issues’ resolution, making everything to become a challenge. There are many other issues related to outsourcing that maybe deserve to be addressed in detail.

The Lack of Vision, Policy and Strategy

An organization needs to have a vision, policy and strategy toward data quality in particular and Data Management in general, in order to plan, enforce and coordinate the overall effort toward quality. Their lack can have unpredictable impact on information systems and reporting infrastructure in particular and on the business as a whole, without it data quality initiatives can have local and narrow scope, without the expected effectiveness, resulting in rework and failure stories. The syntagma “it’s better to prevent than to cure” reliefs the best the philosophy on which Data Management should be centered.

Lack of Ownership

In the context of the lack of policy and strategy can be put also the lack of ownership, though given its importance it deserves a special attention. The syntagma “each employee is responsible for quality” applies to data quality too, each user and department need to take the ownership over the data they have to maintain, for their own or others’ departments scope, same as they have to take the ownership over the reports that make scope of their work, assure their quality and the afferent documentation, over the explicit and implicit islands of knowledge existing.

References:
[1] Informatica. (2009). The Data-Driven Enterprise. [Online] Available from: http://www.informatica.com/downloads/7060_data_driven_wp_web.pdf (Accessed: 6 February 2010).
[2] Herzler. (2006). Eight Aspects of the Data Driven Corporation – Exploring your Gap to Entitlement. [Online] Available from: http://www.hertzler.com/php/portfolio/white.paper.detail.php?article=31 (Accessed: 6 February 2010).
[3] Informatica. (2009). Informatica 9: Infrastructure Platform for the Data-Driven Enterprise, Speaker: Sohaib Abbasi, Chairman and CEO. [Online] Available from: http://www.informatica.com/9/thelibrary.html#page=page-5 (Accessed: 6 February 2010).

05 February 2010

🎡SSIS: Wizarding an SSIS Package (First Magic Class)

I'm sorry to disappoint you, this it’s not another story about Harry Potter! It’s about how to create an SSIS package without any knowledge about SSIS or the components of an SSIS Package with the help of ‘SQL Server Import and Export Wizard’ available from the SQL Server 2008 Management Studio. Here are the steps needed to be followed in order to create a Package that exports the data from a local table to an Excel file.

Step 1: Start SQL Server Import and Export Wizard

From SQL Server Management Studio choose the database from which you want to export the data (e.g. AdventureWorks), right click on it and from the floating menu choose Tasks/Export Data. This action will bring the ‘SQL Server Import and Export Wizard’ used, as its name indicates, for importing and exporting data on the fly from/to SQL Server or any other source.

Step 2: Choose a Data Source

In ‘Choose a Data Source’ step select the ‘Data Source’, SQL Server Native Client 10.0 for exporting data from SQL Server, choose ‘Server name’ from the list of SQL Server available, select the Authentication mode and the Database (e. g. AdventureWorks), then proceed to the next step by clicking ‘Next’.

Step 3: Choose a Destination

In ‘Choose a Destination’ step select the Destination, in this case ‘Microsoft Excel’, browse for the Excel file to which you want to export the data (e.g. Products.xls) and check ‘First row has column names’ in case you want to include the column names, then proceed to the next step by clicking ‘Next’.

Step 4: Specify Table Copy or Query

In ‘Specify Table Copy or Query’ step there are two options you could choose: ‘Copy data from one or more tables or views’ or ‘Write a query to specify the data to transfer’ – their names are speaking for itself. The second option allows for more flexibility and you could just copy paste the query used for your expert. In this example just go with the first option and then proceed to the next step by clicking ‘Next’.

Step 5: Select Source Tables and Views

In ‘Select Source Tables and Views’ step select the database objects (e.g. Production.vProducts) from which you’ll export the data, then proceed to the next step by clicking ‘Next’. If the destination allows it, it’s possible to choose more than one database object.

Step 6: Review Data Type Mapping

In ‘Review Data Type Mapping’ typically you could go with the provided defaults thus you can proceed to the next step by clicking ‘Next’.

Step 7: Save and Run Package

In ‘Save and Run Package’ step you could choose to ‘Run immediately’ the package and/or ‘Save SSIS Package’ to the ‘SQL Server’ or locally to the ‘File System’. Saving the package locally allows you to modify and rerun the package at a later date. For this example check the ‘Run immediately’, respectively the ‘Save SSIS Package’ and ‘File System’ option, then proceed to the next step by clicking ‘Next’.

Step 8: Save SSIS Package

In ‘Save SSIS Package’ step provide the intended Name (e.g. Export Products) or Description of the Package, choose the location where the package will be saved, then proceed to the next step by clicking ‘Next’.

Step 9: Complete the Wizard

The ‘Complete the Wizard’ step allows reviewing the choices made in the previous steps and eventually navigating to the previous steps in order and to the needful changes. Therefore once you proved the details proceed to the next step by clicking ‘Next’.

Step 10: Executing the Package

In ‘Executing the Package’ step the Package is run, the progress being shown as each step. Just close the Package once you reviewed the steps.

If the package run successfully you can go on and check the exported data, apply additional formatting, etc.

Step 11: Inspecting the Package

The package can be reopened and modified in Microsoft Visual Studio or SQL Server Business Intelligence Development Studio, for this just open one of the two environments and from the Main Menu select File/Open File, then browse for the location where the Package (e.g. Export Products.dtsx) was saved and open the file. Here is the package created by the Wizard:

As can be seen the package contains two tasks, the ‘Preparation SQL Task 1’ which creates the table (e.g. vProducts) in the Excel file, and ‘Data Flow Task 1’ which dumps the data in the created table. By double clicking the ‘Data Flow Task 1’ you could see its content: - the ‘Source - vProducts’ OLEDB Data Source holding the connectivity information to the SQL Server and the list of Columns in scope. - the ‘Data Conversion’ Transformation that allows converting the data between Source and Destination given the fact that Excel has different data types than SQL Server - the ‘Destination - vProducts’ Excel Destination holding the connectivity information for the Excel file and the Mappings in place.

You could explore the properties of each object in order to learn more about its attributes used and the values they took.

Ok, so you’ve open the package, how do you run it again? For this is enough to double click on the locally saved Package (e.g. Export Products.dtsx), action that will bring the ‘Execute Package Utility’, and click on the ‘Execute’ button to run the Package.

If you haven’t done any changes to the Excel file to which you moved the data (e.g. Products.xls) and haven’t moved the file, then the package will fail when it attempts to run the ‘the ‘Preparation SQL Task 1’ task because a table with the expected name already exists in the file.

In order to avoid this error to appear you’ll have to delete from the Excel file the sheet (e.g. vProducts) where the data were dumped and save the file. Now you could execute again the package and it should execute without additional issues.

If you’ve removed/deleted the Excel file, even if you get a warning that the Excel file is not available, the package will run in the end without problems. Instead of deleting the sheet manually you could use the File System Task to move the file to other location, though that’s a topic for another post.

🕋Data Warehousing: Star Schema (Definitions)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to the fact table by a key column." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A star schema is a dimensional data model implemented on a relational database." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A star schema is a set of tables comprised of a single, central fact table surrounded by dimension tables. Each dimension is represented by a single dimension table. Star schemas implement dimensional data structures with denormalized dimensions. Snowflake schemas are an alternative to a star schema design." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A single fact table surrounded by a single hierarchical layer of dimensional tables, in a data warehouse database." (Gavin Powell, "Beginning Database Design", 2006)

"A single fact table which joins to many dimension tables, each of which is a single denormalized dimension table." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The instantiation of a dimensional model in a relational database. A star schema consists of a fact table and the dimension tables that it references. The fact table contains facts and foreign keys; the dimension tables contain dimensional attributes by which the facts will be filtered, rolled up, or grouped." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"The implementation of a dimensional model in a relational database. The tables are organized around a single central fact table possessing a multi-part key, and each surrounding dimension table has its own primary key." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"The arrangement of the collection of fact and dimension tables in the dimensional data model, resembling a star formation, with the fact table placed in the middle surrounded by the dimension tables. Each dimension table is in a one-to-many relationship with the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The basic form of data organization for a data warehouse, consisting of a single large fact table and many smaller dimension tables." (Toby J Teorey, ", Database Modeling and Design" 4th Ed., 2010)

"A common form of a dimensional data model, where a fact table is directly linked by foreign keys to several dimension tables." (Craig S Mullins, "Database Administration", 2012)

"A relational database structure in which data is maintained in a single fact table at the center of the schema with additional dimension data stored in dimension tables. Each dimension table is directly related to and usually joined to the fact table by a key column." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"A type of relational database schema that is composed of a set of tables comprising a single, central fact table surrounded by dimension tables. See also dimension table, star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

💎SQL Reloaded: Just in CASE IV (Other Scenarios)

Two days ago I was looking on the Web on random posts to see whether I missed something important related to CASE function and I stumble over an old post of Craig S. Mullins, in one of the examples (See SQL Statement #5) there is a query similar with the one below:

As can be seen on more than one branch of a CASE there is a correlated sub-query based on the same table, the aggregated value being used for further calculations. Most probably the respective query had only the purpose to demonstrate a technique, though I’m having mainly two observations related to it:
1. The query returns a label of whether the Stock is High, Normal, Low or there is no Stock, though I have no measure of what the Stock actually is, thus in a first phase I can’t validate whether the query shows the correct data. In case a report is based on the respective query this reduces considerably also report's usability.
2. A select is replicated across several CASE branches fact that I find it complicates query’s complexity reflected in query’s understanding and maintainability, while considering that the logic needs to be kept in synchronization increases the chances for making a mistake. At least in this situation, being involved only one table, the logic is quite simple though what do you do when the correlated sub-query is more complex?

How about encapsulating query’s logic in a sub-query and join it to the main tables using a left join? This approach would address both issues mentioned above:

    When creating a query there is always an important facts that needs to be considered - query’s performance. The second query is simpler and in theory it should be easier to process, expecting to have at least similar performance as the first version. In this case the ProductID is a foreign key in Production.ProductInventory, the search performed on the respective table having minimum impact on performance. If no index is available on the searched attribute, more likely first query’s performance will decrease considerably. The best approach for mitigating the performance differences between the two queries is to look at Client Statistics and Execution Plan. In what concerns the Client Statistics both queries are having similar performance, while the Execution Plan of the second query excepting the fact that is simpler and, without going into details, it seems the second plan is better.

    Maybe it makes sense to use correlated sub-queries in a CASE only when the correlated sub-queries have distinct logic, though even then I would recommend using a left join instead, this technique allowing more flexibility being possible to show the actual values in the query and even reuse them if needed in additional calculations.

    Talking about correlated queries, I found cases in which two distinct correlated sub-queries with the same logic were used to pull two distinct attributes available in the same table, something like in the below example.

What if the correlated sub-queries are used again with the CASE function like in the below example?

Even if maybe the number of calls to the correlated sub-queries is reduced by using a CASE, I would recommend using a left join instead, this technique offering more flexibility, the logic becoming also much easier to maintain and debug.

04 February 2010

💎SQL Reloaded: Just In CASE III (Quest for Performance)

Arranging branches

Given the fact that the CASE function stops after the first match, in case there are no two expressions that evaluate to true for any same value, the CASE could be optimized for the best performance by arranging the branches in the order of the highest number of matches. For example taking the logic from the 3rd example from Part I and checking the number of occurrences for each value, surprisingly (or maybe not) there are 257 NULL values, 97 ‘L’ values, 82 ‘H’ values, respectively 68 ‘M’ values. If we change the branches in this order, the number of evaluations will decrease from 82*1+68*2+97*3+257*3=1280 to 257*1+97*2+82*3+68*3=901, the difference is small though when applied to big data sets the improvement could pay the effort. Here is the modified logic for Example 3:

This technique is unfortunately not bulletproof – even if less probable in many cases, the order of the values’ occurrences could change overnight, the number of total evaluations varying in time on the number of occurrences for each value.

Rewriting Simple Unions

Have you ever seen a set of similar queries whose output is merged with the help of UNION (ALL) operator? I’ve seen so many queries like the below one, in which the only difference between the sub-queries was residing in the WHERE constraint(s) applied and maybe a few additional calculations!

The example is quite simple, though I think it demonstrates the point. The UNION could be translated to a simple CASE as in the 3rd example. The problem with such queries is that if there is no index on the attribute(s) used is the WHERE constraint, it will be performed a full table scan for each sub-query, quite expensive from a performance standpoint. Even if the respective attribute is indexed, there’s still a pay in performance. Such an approach could be maybe acceptable when in each scenario different tables are joined to the main table, though even then should be checked which one of the two approaches has better performance. A recent example I can recall and in which could have been used several CASE functions instead of the UNION was based on the JOIN of two tables, the query looking something like:

Rewriting Complex Unions

Even if when the base table is used in several union-based merged queries, it doesn’t sense to apply this technique all the times, especially when dealing with complex queries using multiple joins. Now it depends, there are scenarios and scenarios, is must be always considered the trade in performance, readability and usability of a query when applying any technique. The Person.Address from AdventureWorks is the best way to exemplify two scenarios in which it makes sense and doesn’t makes sense to combine the logic in two union queries. The respective table stored the addresses for Customers (Sales & Individuals), Vendors and Employees, thus supposing that we would like to see the Owners and Owner Type for each Address we would need to create several unions.

Let’s consider first the Customer-based Addresses – as there are two types of Customers and because the details for each type are stored in different tables, we might end up creating the following query:

As can be seen the two queries are quite similar, most of the tables joined are the same with 3 exceptions: Sales.Store vs. Sales.Individual & Person.Contact. The query could be rewritten using a CASE and left joins instead of inner joins as follows:

Similar queries could be written in order to get the Vendor and Employee details, and as can be seen also these queries share many similarities.

Does it makes sense to use the above technique also in this case? Maybe it would be acceptable to merge the Vendor with Employee Addresses queries, though does it make sense to do merge also the Customer Addresses using CASE & left joins. The queries are not so complex so it’s possible to do that, though I think a decision should be taken only when has been studied the performance of the two approaches.

Pivoting Behavior

Another technique when a CASE could help eliminate several joins to the same table is presented in a post on List Based Aggregations, the CASE function being used together with aggregated functions in order to cumulate the On Hand for several locations and show it on the same line. This approach was quite useful before the PIVOT operator was introduced in SQL Server 2005 (see Pivot operator example), and is still is in case is needed to select more than 1 attribute for each line.

💎SQL Reloaded: Just In CASE II (Clauses and Joins)

The CASE function could appear in all typical clauses – SELECT, WHERE, GROUP BY, ORDER BY and even in JOIN predicates. If the impact on performance is quite small when CASE is used inside of a SELECT clause, the decrease in performance could be quite high when the CASE is used in searches – WHERE constraints or JOIN predicates. Oracle provides function-based indexes, the smart use of such functions could improve in theory searches performed with CASE/DECODE functions, unfortunately as far I know there is no similar feature available on MS SQL Server family.

The use of CASE functions in WHERE clause is very rare because it could be replaced with several OR operators, the only benefit results maybe from syntax’ clearness or by checking values’ type before using them – the case of numeric and date values. Even if the below example doesn’t make sense from a business point of view, it shows a possible use of CASE within the WHERE clause.

There are systems that misuse an alphanumeric attribute in order to store a numeric or date value, therefore before using the respective values in computations it’s necessary/recommended to test whether the values have the expected data type. I met also cases in which the range of dates covered by a date type in a RDBMS is narrower than a similar date type of other RDBMS (e.g. SQL Server vs. Oracle), in such case the test for valid dates is also necessary. For this purpose the CASE could be used together with the IsDate, respectively IsNumeric, functions as in below examples.

The same data type handling could be used also in joins when with the misuse of alphanumeric data types, the numeric values represent references to other tables, or the date values could be used in the logic of a join. When the misused attribute is on the main table then the impact on performance is minimal, though when the misused attribute appears in the referenced table then the performance could be impacted considerably because is less likely than an index is built on the respective alphanumeric attribute, a full table scan being done. In order to exemplify this situation I will use the dbo.SplitList table-valued function, which takes a list of value and creates a table out of it, and join it with the Production.Product table. I used to Cast the Value attribute from dbo.SplitList to an integer in order to simulate the handling of numeric values.

Even if this use might be look strange for some people, it isn’t utopic, I met many cases in which Users wanted to pull for example all the Products available in the system having the values for a given attribute in a given list, and they wanted to see also the values that had no match. In theory could be easier to rewrite the query as follows, and use in Excel a VLOOKUP function to determine the attributes with no match.

Of course, in this case it would have been much easier to just use the IN operator and provide the list of values.

03 February 2010

🕋Data Warehousing: Snowflake Schema (Definitions)

"An extension of a star schema such that one or more dimensions are defined by multiple tables. In a snowflake schema, only primary dimension tables are joined to the fact table. Additional dimension tables are joined to primary dimension tables. " (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A variation on dimensional schema design in which dimension tables are further normalized, split into multiple tables based on hierarchies in the data." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A single fact table which joins to many dimension tables, with each dimension normalized as two or more tables." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A variation of the star schema in which the business dimensions are implemented as a set of normalized tables. The resulting diagram resembles a snowflake." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A normalized version of the STAR schema in which dimension tables are partially or fully normalized. Not generally recommended because it compromises query performance and simplicity for understanding." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A star schema in which dimension tables reference other tables." (Oracle, "Database SQL Tuning Guide Glossary", 2013)