SQL Troubles

01 April 2010

💎SQL Reloaded: The Power of Joins II (Horizontal Joins)

Introduction

    A horizontal join allows retrieving data from two or more datasets by merging/comparing the rows from each involved dataset, the join of rows being done on vertical, thus a row from the final result dataset containing only the attributes from one of the input datasets, the attributes being matched based on their order in the dataset (first attribute with first attribute from each dataset, second with the second, and so on), the output attribute taking the name of the correspondent attribute from the first. It worth to mention that the comparison is done at row level, thus we could talk about distinct rows if at least one of the attributes is having different values.

    There are two important constraints horizontal joins comes with: the datasets involved in a vertical join must have the same number of attributes and the joined attributes must have the same data type. If one of the attributes is not available in one of the datasets, a NULL or other default value could be provided for the missing attribute, while in case data types can’t be converted implicitly then the conversion must be done explicitly to a data type sufficient to accommodate the data type of the attributes joined under the same attribute.

      The general syntax for an horizontal join could be written as follows:

-- Horizontal Join syntax 
SELECT A.Attribute1 
[, A.Attribute] 
FROM TABLE1 A 
[WHERE <constraints>] 
UNION | INTERSECT | EXCEPT [ALL] 
SELECT B.Attribute1 
[, B.Attribute] 
FROM TABLE2 B 
[WHERE <join_constraints>]  
[ORDER BY <order_attributes>]

Notes:

1. There are 3 horizontal join operators UNION for union (AUB), INTERSECT (A∩B) for intersection and EXCEPT for subtraction (A\B), and only UNION could be used in combination with ALL keyword in order to show ignore the duplicates. Please note that only the union and intersection operators are commutative: AUB = BUA, respectively A∩B = B∩A, while A\B<>B\A.

2. The SELECT statement for each dataset could include any complex logic supported, including aggregates, window ranking and aggregate functions, virtual tables (views, user-defined functions), vertical joins, etc.

3. The horizontal join may contain only one ORDER BY clause placed at the end of the query, and with it could be used directly the attributes’ name or alias given to the columns from first dataset. In case aliases are used, and especially when many attributes are involved or then when the attribute involve complex logic, it makes sense to use the same alias also for the other datasets involved in the join.

For demonstrating the horizontal operations I will use the following two datasets based on the same table, respectively Products table from AdventureWorks database.

-- Dataset A 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 

-- Dataset B 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

The UNION Join

The UNION join, also referred as the outer union, allows combining two datasets in one dataset formed from the combined rows of each data set, the eventual duplicate rows being eliminated. In case are needed also the duplicates, then the ALL keyword could be used in combination with the UNION (also referred as UNION ALL operator), this technique being recommended when the two datasets have no common records because when a simple UNION is used the database engine still checks for duplicates.

-- UNION example (AUB) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
UNION 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

Note:
Because the above two datasets are using the same table the final output is the same with the output of the below query, the reunion being thus applied directly to the constraints used. It is actually recommended to avoid using unions:

-- alternative for UNION example (AUB) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 250

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the UNION, respectively UNION ALL operators:

The INTERSECT Join

The INTERSECT operator combines the two datasets in one dataset formed from the rows found to be present in both datasets.

-- INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
INTERSECT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250

Note:
Because the above two datasets are using the same table the final output is the same with the output of the following query, the intersection could be thus applied directly to the constraints used:

-- alternative for INTERSECT example (A∩B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 200

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the INTERSECT operator:

horizontal joins - intersect presentation

The EXCEPT Join

-- EXCEPT example (A\B) 
SELECTITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
,ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 

-- EXCEPT example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 250 
EXCEPT 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 200

As can be seen from the above two examples there are two possibilities of subtracting two datasets – A\B, respectively B\A, selecting the rows which are only in A and not in B, respectively the rows which are in B and not in A.

Note:
Because the above two datasets are using the same table the final output is the same with the output of the following queries, the subtraction could be thus applied directly to the constraints used:

-- alternative for SUBTRACTION example (A\B) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost 
, ITM.MakeFlag 
FROM Production.Product ITM 
WHERE ITM.StandardCost >= 0 
    AND ITM.StandardCost < 100 

-- alternative for SUBTRACTION example (B\A) 
SELECT ITM.ProductID 
, ITM.Name ProductName 
, ITM.ProductNumber 
, ITM.StandardCost , ITM.MakeFlag 
FROMProduction.Product ITM 
WHERE ITM.StandardCost > 200 
    AND ITM.StandardCost <= 250

In order to exemplify what’s happening in the background I will use only two rows from each of the two datasets used as input, on the right side being shown the output of applying the EXCEPT for A\B, respectively B\A:

Combining Operators

For the sake of simplicity all the above exemplifications are using only two datasets, though multiple datasets could be joined in this way, here is an example for the UNION ALL based on three datasets (please note that I avoided to list the attributes):

-- multiple UNION 
SELECT * 
FROMProduction.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
UNION ALL
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 100 AND 150 
UNION ALL 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 200 AND 250

Note:
Also in this case the union could be applied directly on the constraints, though because there is no overlap between any combination of the three ranges, the constraint needs to be modified accordingly:

  
-- multiple UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 0 AND 50 
    OR ITM.StandardCost BETWEEN 100 AND 150 
    OR ITM.StandardCost BETWEEN 200 AND 250

When the union and intersection operations are used with multiple datasets it doesn’t matter the order of the datasets because the two operations are associative: (AUB)UC=AU(BUC)=AUBUC, respectively (A∩B)∩C=A∩(B∩C)=A∩B∩C, though not the same could be said about subtraction because (A\B)\C <> A\(B\C) or the cases in which operators are combined, for example (A\B)UC<>A\(BUC), so is needed to use parentheses in order to avoid eventual surprises.

-- (A\B)UC 
( 
    -- subtraction 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 0 AND 150 
    EXCEPT 
    SELECT * 
    FROM Production.Product ITM 
    WHERE ITM.StandardCost BETWEEN 100 AND 150 
) 
UNION 
SELECT * 
FROM Production.Product ITM 
WHERE ITM.StandardCost BETWEEN 150 AND 250

Note:
Knowing the basics of Set Theory is important in making your queries returning the expected results, the use of some of the identities allowing to simplify considerably the queries sometimes. Without a set-mindset I think it will be difficult to solve such problems.

30 March 2010

💎SQL Reloaded: The Power of Joins – Part I: Introduction

Joins are the heart of RDBMS, they allowing to retrieve data (bring data together in a single result dataset) from multiple (related) table-like objects (tables, views, user-defined functions) by defining how and which records are selected from such objects. In order to reduce the complexity of such a topic I will introduce the joins with regard to tables, the concepts applying also to views and user-defined functions, especially when they are regarded as “virtual” tables. In theory any two tables could be joined independently on whether a direct or indirect relation exists between them, it can be a foreign key, a natural key, a date or any other attribute that could be used to get a certain view into the data. Typically a join involves a minimum of two tables, but before going dipper into the topic it makes sense to look a little at the various terms used to refer to the tables joined and the anatomy of a join.

There are various ways of naming the tables involved in a join:
- left vs. right tables, where the left table is the first listed table participating in the join and the right table the second listed table participating in the join;
- parent vs. dependent tables, refers to tables between which exists a direct relationship, in other words exists a primary key and a foreign key, the dependent table being the table containing the foreign key, and the parent table the table containing the primary key;
- parent vs. child tables, is similar with the previous definition, the child table being an alternative naming for the dependent table, in addition we can talk also about grandchildren when a child table is referenced at its turn by other table called thus, when the three tables are considered together, the grandchild table;
- referent vs. referenced tables, another alternative for naming the tables between which exists a direct relationship, the referenced table being the table containing the foreign key, and the referent table the ones containing the primary key;
- inner vs. outer tables, these terms are used only in the case of outer joins, the inner table being the table from which are selected all the records, while the outer table is the table from which are selected only the matched records, therefore the two are also called sometimes the row-preserving table, respectively the null-supplying table;

None of the above pair of naming conventions are perfect because they can’t be used to address all the join situations in which tables are involved, and some of the terms may be used interchangely, especially when a term is preferred in order to send across adequate meaning, for example referring to a table as a inner table then we imply also that we deal with a left or right join. I realized that in time I used all the the above terms in a situation or another, however I prefer to use the left vs. right tables denomination, and occasionally inner vs. outer tables when referring to left or right joins.

In addition to the two tables joined a query typically includes also a join operator that specifies the join type (union, inner, outer or cross join), the join constraints that specifies which attributes are used to join the tables and the logical operators used, and eventually the non-join constraints referring to other constraints than the join constraints. I’m saying typically because either of the two could be missing, thus if the the join constraint and operator miss then we deal with a Cartesian join, while if the join constraint is appearing in the WHERE instead of FROM clause then there is no need to specify the join operator or in case of a UNION there is no need to specify the join constraint, though I will come back later on this topic.

The work with joins is deeply rooted in the Set Theory (the tables joined referred also as datasets) and the set operations – union, subtraction and intersection, their projection in world of databases coming, from my point of view and not sure if the database literature really discusses about this, into two flavors – vertical vs. horizontal joining. By vertical joining I’m referring to the intrinsic character of joins, in the result dataset the attributes from the two joined datasets appearing on the same row, while in the horizontal joining they never appear on the same row, thus a row being formed exclusively from the records of only one of the tables involved in the join. Thus we can discuss about vertical/horizontal joins as projection of union, intersection and subtraction operations, each of them with different connotations, the horizontal operations being actually similar to the ones from set theory. If we discuss about joins then we inevitably we have to mention also the anti-joins and semi-joins, a special type of queries based on NOT IN, respectively EXISTS operators.

|>> Next Post

25 March 2010

🕋Data Warehousing: Axis (Definitions)

"A set of tuples. Each tuple is a vector of members. A set of axes defines the coordinates of a multidimensional dataset. For more information about axes, see your OLE DB documentation." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"An axis specifier indicates the relationship between the nodes selected by the location step and the context node. Examples of axis specifiers include child, parent, and ancestor." (Michael Coles, "Pro T-SQL 2008 Programmer's Guide", 2008)

"A set of tuples. Each tuple is a vector of members. A set of axes defines the coordinates of a multidimensional data set." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Specifier that indicates the relationship between the nodes selected by the location step and the context node. Examples of axis specifiers include child, parent, and ancestor." (Miguel Cebollero et al, "Pro T-SQL Programmer’s Guide" 4th Ed, 2015)

🧭Business Intelligence: Enterprise Reporting (Part VII: A Reporting Guy’s Issues)

Business Intelligence Series

Introduction

For more than 6 years, between many other tasks (software development, data migrations, support), I was also the "reporting guy", taking care of ad-hoc reporting requirements, building a data warehouse and a reporting solution for the customer I worked for. 99% of the reports were based on the two ERP systems (Oracle e-Business Suite & IFS-iV) the customer had in place during this time, fact that helped me learn a lot about the data architecture of such systems, about processes, data, data quality and many other issues related to data, how to do and how not do things. In this post I just want to highlight some of the issues I was confronted with, and I don’t intend to point the finger at anybody, so I apologize if anybody is offended!

The Lack of Knowledge about the Business

Even if it’s hard to believe, the main issue was revolving around the lack of relevant documentation on applications, especially on database models and processes, or, even if there were such documents, they were not updated and the value of true of the information contained were supposed to be always checked against the data. Of course, there are always knowledge workers from whom valuable information could be elicited though they were not always available and many of them are highly specialized in their field. Therefore, one needs to interview multiple users to build a close to complete picture, and even then, one must check the newly acquired information against the data! From time to time, one may even find out that the newly acquired information doesn’t entirely match the reality, that there are always exceptions and (business) rules forgotten or not known.

Sometimes, it’s easier to derive knowledge directly from the data, table structure and other developers’ experience (e.g. blogs, books, forums) rather than hunting down the knowledge workers. Things aren't that bad, despite the reengineering part, in the end one manages to get the job done, though it takes more time. Sometimes it took even 2-3 more time to accomplish a task, time for which one could found better use. However, in time, accumulating more experience, I become proactive by exploring (and mapping) the unknown "territories" in the breaks between tasks, a fact that allowed me to easier fulfill users’ reporting requests.

Oracle e-Business Suite

During the past 3 years I have been supporting mainly Oracle e-Business Suite (EBS) users with reports and knowledge about the system, and therefore most of the issues were related to it. In addition to its metadata system implemented in system table structure, Oracle tried to build an external metadata system for its ERP system, namely Metalink (I think it was replaced last year), though there were/are still many gaps in documentation. It’s true that many such gaps derive from the customizations performed when implementing the ERP system, though I would estimate that they qualify only for 20% of the gaps and refer mainly to the Flex Fields (customer-defined fields) used for the various purposes.

A second important issue was related to the Oracle database engine, were several bugs not patched that didn’t allowed me to use SQL ANSI 92 syntax for linking more than 6-7 tables to a parent table, fact that made me abuse of inline views to solve this issue; even if Oracle had for long a patch to address this, it wasn’t deployed by admins, maybe from well supported reasons. A third issue was related to the different naming conventions used for the same attributes from the source system, mainly a result of the fact that solutions brought from other bought vendors were integrated with a minimum of changes. A fourth issue is related to the poor UI and navigation, basic if we consider the advances made in web technologies during the past years. Conversely, given the complexity of an ERP system, it’s challenging to change the UI at this scale.

Self-Service BI

There's the belief that everybody can write a query and/or of an ad-hoc report. It’s true that writing a query is a simple task and in theory anybody could learn to it without much of effort, though there are other aspects related to Software Engineering and Project Management, respectively related to a data professional's experience than need to be considered. There are aspects like choosing the right data source, right attributes and level of detail, design the query or solution for the best performance (eventually building an index or using database objects that allow better performance) and reuse, use adequate business rules (e.g. ignoring inactive records or special business cases), synchronize the logic with other reports otherwise two people will show the management distinct numbers, mitigate the Data Quality and Deliverables Quality issues, identify the gaps between reports, etc.

In addition, having users create personal solutions instead of using a more standardized approach is quite risky because the result is a web of such siloed solutions over which there's no control from a strategic and/or data security point of view. Conversely, the users need the possibility to analyze the data by themselves (aka self-service BI), to join data coming from multiple sources. Therefore, special focus should be given also to such requirements, though once their reporting needs have been stabilized, they should be moved, if possible, to a more standardized solution.

When multiple developers are approaching reporting requirements they should work as a team and share knowledge not only on the legacy system but also on users’ requirements, techniques used and best practices. Especially when they are dispersed all over the globe, I know it’s difficult to bring cohesion in a team, make people produce deliverables as if they were written by the same person, though not so impossible to achieve with a little effort from all the parties involved.

Why I’m mentioning this? The problem is that the more variability is introduced in deliverables, greater the risk is to have the quality of deliverables questioned, in time leading to users not adopting a system or preferring to use one resource in the detriment of another. Moreover, must be considered also the effort needed to find the gaps between reports, to modify deliverables to expectations, etc. From this perspective is always a good idea to document at least at minimum all deliverables, detailing the scope and particularities of the respective request. I know that many believe that code is self-explanatory and needs no additional documentation, though when the basic needed documentation is not available, it's occasionally challenging to intuit the context and identify the problem, respectively why a technique or a certain level of detail was preferred, or why some constraints were used.

Outsourcing

Outsourcing is a hot topic these days, when in the context of the current economic crisis organizations are forced to reduce the headcount and cut costs, and thus this has inevitably touched also the reporting area. Outsourcing makes sense when the interfaces between service providers and the customers are well designed and implemented. Beyond the many benefits and issues outsourcing approaches come with, people have to consider that for a developer to create a report is needed not only knowledge about the legacy systems and tools used to extract, transform and prepare the data, but also knowledge about the business itself, about users expectations and organization’s culture, the latter two points being difficult to experience in a disconnected and distributed environment.

Of course, even if delivering the same result and quality is possible as if the developers were onsite, in the end outsourcing implies additional iterations and overwork, the users need to be trained to specify the reporting requirements adequately or a special resource needs to be available to translate the requirements between the parties involved, lot of back-and-forth communication and all the other issues deriving from it.

Outsourcing makes sense from a reporting perspective, though it might take time to become efficient. Anyway, the decisions for this approach are usually taken at upper management level. From a reporting guy's perspective, if I consider the amount of additional effort and time spent to deliver comparable quality, I will say "No" to an outsourcing model when the time used to build something is just shifted in managing the communication with the outsourcer, writing emails after emails for issues that could have been solved in a 10-minute meeting. Probably the time and money can be invested in other resources that better enable the process.

Previous Post <<||>> Next Post

Written: Mar-2010, Last Reviewed: Apr-2024

🧊Data Warehousing: Mea Culpa (Part I: A Personal Journey)

Any discussion on data warehousing topics, even unconventional, can’t avoid to mention the two most widely adopted concepts in data warehousing, B. Inmon vs. R. Kimball’s methodologies. There is lot of ink consumed already on this topic and is difficult to come with something new, however I can insert in between my experience and personal views on the topic. From the beginning I have to state that I can’t take any of the two sides because from a philosophical viewpoint I am the adept of “the middle way” and, in addition, when choosing a methodology we have to consider business’ requirements and objectives, the infrastructure, the experience of resources, and many other factors. I don’t believe one method fits all purposes, therefore some flexibility is needed into this concern even from most virulent advocates. After all in the end it counts the degree to which the final solution fits the purpose, and no matter how complex and perfect is a methodology, no matter of the precautions taken, given the complexity of software development projects there is always the risk for failure.

B. Inmon defines the data warehouse as a “subject-oriented, integrated, non-volatile and time-varying collection of data in support of the management’s decisions” [3] - subject-oriented because is focused on an organization’s strategic subject areas, integrated because the data are coming from multiple legacy systems in order to provide a single overview, time-variant because data warehouse’s content is time dependent, and non-volatile because in theory data warehouse’s content is not updated but refreshed.

Within my small library and the internet articles I read on this topic, especially the ones from Kimball University cycle, I can’t say I found a similar direct definition for data warehouse given by R. Kimball, the closest I could get to something in this direction is the data warehouse as a union of data marts, in his definition a data mart is “a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions” [2]. This reflects also an important difference between the two approaches, in Inmon’s philosophy the data marts are updated through the data warehouse, the data in the warehouse being stored in a 3rd normal form, while in data marts are multidimensional and thus denormalized.

Even if it’s a nice conceptual tool intended to simplify data manipulation, I can’t say I’m a big fan of dimensional modeling, mainly because it can be easily misused to create awful (inflexible) monster models that can be barely used, sometimes being impossible to go around them without redesigning them. Also the relational models could be easily misused though they are less complex as physical design, easier to model and they offer greater flexibility even if in theory data’s normalization could add further complexity, however there is always a trade between flexibility, complexity, performance, scalability, usability and reusability, to mention just a few of the dimensions associated with data in general and data quality in particular.

In order to overcome dimensional modeling issues R. Kimball recommends a four step approach – first identifying the business processes corresponding to a business measurement or event, secondly declaring the grain (level of detail) and only after that defining the dimensions and facts [1]. I have to admit that starting from the business process adds a plus to this framework because in theory it allows better visibility over the processes, supporting processed-based data analysis, though given the fact that a process could span over multiple data elements or that multiple processes could partition the same data elements, this increases the complexity of such models. I find that a model based directly on the data elements allows more flexibility in the detriment of the work needed to bring the data together, though they should cover also the processes in scope.

Building a data warehouse it’s quite a complex task, especially if we take into consideration the huge percentage of software projects failure that holds also in data warehousing area. On the other side not sure how much such statistics about software projects failure can be taken ad literam because different project methodologies and data collection methods are used, not always detailed information are given about the particularities of each project, it would be however interesting to know what the failure rate per methodology. Occasionally there are some numbers advanced that sustain the benefit of using one or another methodology, and ignoring the subjective approach of such justifications they often lack adequate details to support them.

My first contact with building a data warehouse was almost 8 years ago, when as part of the Asset Management System I was supposed to work on, the project included also the creation of a data warehouse. Frankly few things are more scaring than seeing two IT professionals fighting on what approach to use in order to design a data warehouse, and is needless to say that the fight lasted for several days, calls with the customer, nerves, management involved, whole arsenal of negotiations that looked like a never ending story.

Such fights are sometimes part of the landscape and they should be avoided, the simplest alternative being to put together the advantages and disadvantages of most important approaches and balance between them, unfortunately there are still professionals who don’t know how or not willing to do that. The main problem in such cases is the time which instead of being used constructively was wasted on futile fights. When lot of time is waisted and a tight schedule applies, one is forced to do the whole work in less time, leading maybe to sloppy solutions.

A few years back I had the occasion to develop one data warehouse around the two ERP systems and the other smaller systems one of the customers I worked for was having in place, SQL Server 2000 and its DTS (Data Transformation Services) functionality being of great help for this purpose. Even if I was having some basic knowledge on the two data warehousing approaches, I had to build the initial data warehouse from scratch evolving the initial solution in time along several years.

The design was quite simple, the DTS packages extracting the data from the legacy systems and dumping them in staging tables in normalized or denormalized form, after several simple transformations loading the data in the production tables, the role of the multidimensional data marts being played successfully by views that were scaling pretty well to the existing demands. Maybe many data warehouse developers would disregard such a solution, though it was quite an useful exercise and helped me to easier understand later the literature on this topic and the issues related to it. In addition, while working on the data conversion of two ERP implementations I had to perform more complex ETL (Extract Transform Load) tasks that the ones consider in the data warehouse itself.

In what concerns software development I am an adept of rapid evolutional prototyping because it allows getting customers’ feedback in early stages and thus being possible to identify earlier the issues as per customers’ perceptions, in plus allowing customers to get a feeling of what’s possible, how the application looks like. The prototyping method proved to be useful most of the times, I would actually say all the times, and often was interesting to see how customers’ conceptualization about what they need changed with time, changes that looked simple leading to partial redesign of the application. In other development approaches with long releases (e.g. waterfall) the customer gets a glimpse of the application late in the process, often being impossible to redesign the application so the customer has to live with what he got. Call me “old fashion” but I am the adept of rapid evolutional prototyping also in what concerns the creation of data warehouses, and even if people might argue that a data warehousing project is totally different than a typical development project, it should not be forgotten that almost all software development projects share many particularities from planning to deployment and further to maintenance.

Even if also B. Inmon embraces the evolutional/iterative approach in building a data warehouse, from a philosophical standpoint the rapid evolutional prototyping applied to data warehouses I feel it’s closer to R. Kimball’s methodology, resuming in choosing a functional key area and its essential business processes, building a data mart and starting from there building other data marts for the other functional key areas, eventually integrating and aligning them in a common solution – the data warehouse. On the other side when designing a software component or a module of one application you have also to consider the final goal, as the respective component or module will be part of a broader system, even if in some cases it could exist in isolation. Same can be said also about data marts’ creation, even if sometimes a data mart is rooted in the needs of a department, you have to look also at the final goal and address the requirements from that perspective or at least be aware of them.

Previous Post <<||>> Next Post

References:

[1] M. Ross R. Kimball, (2004) Fables and Facts: Do you know the difference between dimensional modeling truth and fiction? [Online] Available from: http://intelligent-enterprise.informationweek.com/info_centers/data_warehousing/showArticle.jhtml;jsessionid=530A0V30XJXTDQE1GHPSKH4ATMY32JVN?articleID=49400912 (Accessed: 18 March 2010)

[2] R. Kimball, J. Caserta (2004). The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley Publishing Inc. ISBN: 0-7645-7923 -1

[3] Inmon W.H. (2005) Building the Data Warehouse, 4th Ed. Wiley Publishing. ISBN: 978-0-7645-9944-6

24 March 2010

🎓Knowledge Management: Definitions (Part I: The Stored Procedure Case)

I just started to read the SQL Server Programmers Blog’s post on Stored Procedures and I’m having a problem with the first sentence: "A stored procedure is a group of Transact-Sql statements compiled into a single execution plan". Not sure if the respective statement was supposed to define what a stored procedure is or highlight its most important characteristics; if it's the first choice then, if I’m not mistaking, also a user-defined function (UDF) (referring here especially to multi-statement UDFs) is a group of compiled T-SQL Statements for which is created an execution plan, and a user-defined function is not a stored procedure! In general I’m having a simple rule I’m trying to use when defining concepts - if in doubt with a definition for a given concept stick to the documentation or to the "classics" – definitions provided by well known professionals or sources. (SQL Server 2016 introduced natively compiled UDFs)

From the few books I roughly reviewed on SQL Server-related topics I liked P. Petrovic’s approach for defining the stored procedure, he introducing first the batch defined as “a sequence of Transact-SQL statements and procedural extensions” that “can be stored as a database object, as either a stored procedure or UDF [3]. Now even if I like the approach I’m having a problem on how he introduced the routine because he haven’t gave a proper definition and just mentioned that a routine can be either a stored procedure or UDF [3]. On the other side it’s not always necessary to introduce terms part of the common shared conceptual knowledge, though I find it useful and constructive if such a definition would have been given.

A definition should be clear, accurate, comprehensive, simple and should avoid confusion, eventually by specifying the intrinsic/extrinsic characteristics that differentiate the defined object from other objects, and the context in which is used. Unfortunately there are also definitions given by professionals that don't meet this criteria, for example A. Watt defines a stored procedure as "a module of code that allows you to reuse a desired piece of functionality and call that functionality by name" [5], while R. Dewson defines it as “a collection of compiled T-SQL commands that are directly accessible by SQL Server” [2], R. Rankins et. al as “one or more SQL commands stored in a database as an executable object” [4] or D. Comingore et Al “stored collections of queries or logic used to fulfill a specific requirement” [1]. All definitions are valid and in spite similarities I find them incomplete because they could be used as well for defining an UDF.

Of course, by reading the chapter or the whole book in which the definition is given, by comparing the concept with other concepts, people arrive to a more accurate form of the definition(s), though that’s not always efficient and constructive because the reader has to “fish” for all the direct-related concepts and highlight the similarities/differences between them (e.g. stored procedures vs. UDF vs. views vs. triggers). Usually for easier assimilation and recall I like to use Knowledge Mapping structures/techniques like Mind Maps or Concept Maps that allows seeing the relations (including similarities/differences) between concepts and even identify new associations between them. In addition, when learning concepts it matters also the form/sequence in which the concepts are presented (maybe that’s why many people prefer a PowerPoint presentation than reading a whole book).

Actually a definition could be built starting from the identified characteristics/properties of concepts and the similarities/differences with other concepts For example Bill Inmon defines the data warehouse as “a subject-oriented, integrated, time-varying, non-volatile collection of data in support of the management's decision-making process” and even if we could philosophy also on this subject and the intrinsic characteristics of a data warehouse, it reflects Bill Inmon’s conception in a clear, simplistic and direct manner. I could attempt to define a stored procedure using the following considerations:
- it encapsulates T-SQL statements and procedural extensions (input/output parameters, variables, loops, etc.);
- it can be executed as a single statement and thus not reused in other DML or DDL statements;
- it’s a database object;
- caches, optimizes and reuses query execution plans;
- allows code modularization allowing thus code reuse, easier/centralized code maintenance, move business logic (including validation) on the backend;
- minimizes the risks of SQL injection attacks;
- can return multiple recordsets and parameterized calls reducing thus network traffic;
- enforce security by providing restricted access to tables;
- allows some degree of portability and standardized access (given the fact that many RDBMS feature stored procedures).
- allows specifying execution context;
- allows using output parameters, cursors, temporary tables, nested calls (procedure in procedure), create and execute dynamic queries, access to system resources, trap and customize errors, perform transaction-based operations.

Some of the mentioned characteristics apply also to other database objects or they are not essential to be mentioned in a general definition, thus giving a relatively accurate definition for stored procedures is not an easy task. Eventually I could focus on the first three points mentioned above, thus an approximate definition would reduce to the following formulation: “a stored procedure is a database object that encapsulates T-SQL statements and procedural extensions, the object being executed exclusively within a single statement using its name and the eventual parameters”. This definition might not be the best, though it’s workable and could be evolved in case new knowledge is discovered or essential new functionality changes are introduced.

Note:
Unfortunately, in several books read on data warehousing topics I found similar incomplete/vague definitions that seem inadequate given the complexity of the subjects exposed, letting thus lot of place for divagations. Of course, in technical books the weight is more on the use of exposed concepts rather on concepts’ definition, though I’m expecting more from such books!

Disclaimer:

I’m not trying to question the impressive work of other professionals, and I'm aware that I can mistake as well! I’m just trying to point out a fact I remarked and I consider as important: trying to give a proper accurate definition of the terms introduced in a book or any other form of communication can prove to be challenging.

References:
[1] Comingore D., Hinson D. (2006). Professional SQL Server™ 2005 CLR Programming. Wiley Publishing. ISBN: 978-0-470-05403-1.
[2] Dewson R. (2008). Beginning SQL Server 2008 for Developers: From Novice to Professional. Apress. ISBN: 978-1-4302-0584-5
[3] Petkovic D. (2008). Microsoft® SQL Server™ 2008: A Beginner’s Guide. McGraw-Hill. ISBN: 0-07-154639-1
[4] Rankins R., Bertucci P., Gallelli C., Silverstein A.T, (2007) Microsoft® SQL Server 2005 Unleashed. Sams Publishing. ISBN: 0-672-32824-0
[5] Watt A. (2006). Microsoft SQL Server 2005 for Dummies. Wiley Publishing. ISBN: 978-0-7645-7755-0.
[6] Inmon W.H. (2005) Building the Data Warehouse, 4th Ed. Wiley Publishing. ISBN: 978-0-7645-9944-6

🕋Data Warehousing: Data Lake (Definitions)

"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples." (James Dixon, "Pentaho, Hadoop, and Data Lakes", 2010) [sorce] [first known usage]

"At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems', partners', and collaborators' data flows into it and insights spring out. [...] Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze." (Beulah S Purra & Pradeep Pasupuleti, "Data Lake Development with Big Data", 2015)

"Data lakes are repositories of raw source data in their native format that are stored for extended periods." (Saumya Chaki, "Enterprise Information Management in Practice", 2015)

"A repository of data used to manage disparate formats and types of data for a variety of uses." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A storage system designed to hold vast amounts of raw data in its native (ingested) format, usually in a flat or semi-structured format. Extract, transform, and load (ETL) operations are usually applied to data lakes to extract local data marts for downstream computation." (Benjamin Bengfort & Jenny Kim, "Data Analytics with Hadoop", 2016)

"Data Lake is an analytics system that supports the storing and processing of all types of data." (Maritta Heisel et al, "Software Architecture for Big Data and the Cloud", 2017)

"A data lake is a central repository that allows you to store all your data—structured and unstructured - in volume […]" (Holden Ackerman & Jon King, "Operationalizing the Data Lake", 2019)

"A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning." (Piethein Strengholt, "Data Management at Scale", 2020)

"A repository for storing unstructured and structured data that is downloaded in its raw form and stored by a highly scalable, distributed files system known as open source." (Marcin Flotyński et al, "Non-Technological and Technological (SupTech) Innovations in Strengthening the Financial Supervision", 2021)

"Data lakes are massive repositories for original, raw and unstructured data which is collected from various sources across a smart city. The data from data lakes can be cleansed and transformed for further analytics and modeling." (Vijayaraghavan Varadharajan & Akanksha Rajendra Singh, "Building Intelligent Cities: Concepts, Principles, and Technologies", 2021)

"A data lake is a central location, that holds a large amount of data in its native, raw format, as well as a way to organize large volumes of highly diverse data." (databricks) [source]

"A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale." (Amazon) [source]

"A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores." (Gartner)

"A data lake is a collection of long-term data containers that capture, refine, and explore any form of raw data at scale. It is enabled by low-cost technologies that multiple downstream facilities can draw upon, including data marts, data warehouses, and recommendation engines." (Teradata) [source]

"A data lake is a large and diverse reservoir of corporate data stored across a cluster of commodity servers running software, most often the Hadoop platform, for efficient, distributed data processing." (Qlik) [source]

"A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources." (Oracle)

"A Data Lake is a service which provides a protective ring around the data stored in a cloud object store, including authentication, authorization, and governance support." (Cloudera) [source]

"A data lake is a type of data repository that stores large and varied sets of raw data in its native format." (Red Hat) [source]

"A data lake is an unstructured data repository that contains information available for analysis. A data lake ingests data in its raw, original state, straight from data sources, without any cleansing, standardization, remodeling, or transformation. It enables ad hoc queries, data exploration, and discovery-oriented analytics because data management and structure can be applied on the fly at runtime, unlike traditional structured data storage which requires a schema on write." (TDWI)

"A storage repository that holds a large amount of raw data in its native format until it is needed." (Solutions Review)

23 March 2010

🕋Data Warehousing: Data Transformation (Definitions)

"A set of operations applied to source data before it can be stored in the destination, using Data Transformation Services (DTS). For example, DTS allows new values to be calculated from one or more source columns, or a single column to be broken into multiple values to be stored in separate destination columns. Data transformation is often associated with the process of copying data into a data warehouse." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The process of reformatting data based on predefined rules. Most often identified as part of ETL (extraction, transformation, and loading) but not exclusive to ETL, transformation can occur on the CDI hub, which uses one of several methods to transform the data from the source systems before matching it." (Evan Levy & Jill Dyché, "Customer Data Integration", 2006)

"Any change to the data, such as during parsing and standardization." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"A process by which the format of data is changed so it can be used by different applications." (Judith Hurwitz et al, "Service Oriented Architecture For Dummies" 2nd Ed., 2009)

"Converting data from one format to another|making the data reflect the needs of the target application. Used in almost any data initiative, for instance, a data service or an ETL (extract, transform, load) process." (Tony Fisher, "The Data Asset", 2009)

"Changing the format, structure, integrity, and/or definitions of data from the source database to comply with the requirements of a target database." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The SSIS data flow component that modifies, summarizes, and cleans data." (Microsoft, "SQL Server 2012 Glossary", 2012)

"Data transformation is the process of making the selected data compatible with the structure of the target database. Examples include: format changes, structure changes, semantic or context changes, deduping, and reordering." (Piethein Strengholt, "Data Management at Scale", 2020)

"1. In data warehousing, the process of changing data extracted from source data systems into arrangements and formats consistent with the schema of the data warehouse. 2. In Integration Services, a data flow component that aggregates, merges, distributes, and modifies column data and rowsets." (Microsoft Technet)

17 March 2010

🧮ERP: Implementations (Part II: General Points of Failure)

ERP Implementations Series

Answering to Why many ERP implementations fall short of expectations and how to prevent failure? discussion in LinkedIn’s Oracle ERP group, discussion based on a Webcast on the same topic, I tried to list a few of the possible reasons for failure as follows:

not understanding what an ERP is about - functional and non-functional requirements;
not having trustful partners to accompany though the challenges and close the gaps;
not evaluating/assessing ERP's functionality beforehand: processes, data management, features;
not considering proofs-of-concept;
not getting the accept/involvement of all stakeholders + politics;
not addressing the requirements beforehand, especially in the area of processes;
not evolving/improving your processes;
not addressing the data cleaning/conversion adequately;
not integrating the ERP with the other solutions existing in place;
not having in place a (Master) Data Management vision/policy that addresses especially data quality and data ownership;
not involving the (key) users early in the project;
not training and motivating adequately the users;
lack of a reporting framework, set of reports (reporting tools) that enables users to make most of the ERP;
lack of communication between business and IT professionals;
relying too much on external resources, especially in the PM area;
the migration of workforce inside the organization or outside (e.g. consultants coming and leaving);
inadequate PM, lack of leadership;
the lack of a friendly User Interface (referring to the ERP system itself);
inadequate post-Go Live support from the ERP vendor and business itself;
the lack of an ERP to evolve with the business;
too many defects in the live system (results often from inadequate testing but also vendor related issues);
too many changes on the last 100 m of the project;
organization's culture;
attempting to do too much in lesser time than needed/allocating more time than needed;
not manage the scope adequately.

On the same list one could add also the following reasons:

not understanding business’ needs and the business as a whole;
the inadequate choice/use of methodologies related to PM, Software Development, Data Quality in particular and Data/Knowledge Management in general;
ignoring Software Projects’ fallacies in general and other ERP projects’ failure causes in particular (not learning from others’ mistakes);
ignoring best practices in the ERP/Software Development/Project Management/Data Management fields;
not having a Risk Mitigation (Response) Plan [falls actually under inadequate PM but given its importance deserves to be mentioned separately];
not addressing requirements equidistantly, as opposed to unilaterally (not becoming one department’s project or even one-man-show);
ignoring special reporting requirements during implementation phase;
unrealistic expectations vs. not meeting business’ expectations: ROI, incapacity to answer to business (decision making) questions (actually it refers mainly to the existing reports but also to the system itself when the needed data are not available at the needed grain);
unable to quantify adequately the ROI for the ERP-system.
not making expectations explicit and not communicating them on-time and in a clear manner;
lack of meeting governmental requirements (SOX, IRS, etc).
not monitoring Post-Go Live systems’ use/adoption (e.g. by defining Health/Growth metrics) and addressing adoption issues in-time;
lack of mapping and transferring/distributing the knowledge existing related to the ERP system (experts, processes, documentation, reports, best practices, etc.);
not integrating customer’ customers/vendors’ requirements/needs (supply chain vs. sales chain);
using inadequate technologies/tools to solve ERP-related issues;
the lack of ERP systems to be integrated with new advances in technologies ([3] refers it as technological convergence);
the hidden costs of ERP implementations and Post-Go Live support;
expecting IT/business to solve all the problems;
over-customization of software [1];
over-integration of software;
not having a global technological view (on how the ERP fits in the technological infrastructure);
underestimating project’s complexity, respectively not trying to simplify the overall structure; implementation
unbalanced daily work vs. project workload [2];
choosing wrong time for implementation [2];
engaging in too many corporate projects [3];
multiple vendors on the project [3];
not having clear phases, deliverables, boundaries, accountability, quality control components[3], communication channels defined;
not having an external project audit committee [3];
having the management over- committed [3];
inadequate skill sets [3]
not addressing the loose ends.

Given the multiple mitigation solutions/approaches for each of the above causes and the interconnectedness between them, each of the above topics deserves further elaboration. There is also lot of philosophy involved, some of them are more important than the others, though all of them could be in time a cause for failure. Failure at its turn it’s a vague concept, highly dependent on context, an ERP implementation could be successful based on initial expectations but could fail to be adopted by the business, same as it can meet business expectations on short term but be not so flexible as intended on the long term. In the end it depends also on Users and Management’s perception, on the issues appearing after Go Live and their gravity, though many such issues are inherent, they are just a projection of the evolving business and system’s maturity.

Preventing the failure of an ERP implementation relies on the capacity of addressing all or most important of the above issues, being aware of them and making at least a Risk Mitigation Plan with a few possible solutions to address them. In a project of such a complexity and so many constraints, planning seems to be useless but is a good exercise, it prepares you for the work and events ahead!

Previous Post <<||>> Next Post

References:
[1] Barton P (2001). Enterprise Resource Planning: Factors Affecting Success and Failure [link]
[2] ERP Wire (2009). Analyzing ERP failures in Hershey [link]
[3] Madara E (2007) A Recipe and Ingredients for ERP Failure [link]

16 March 2010

🕋Data Warehousing: Data Pipeline/Pipelining (Definitions)

"A series of operations in an aggregation process." (MongoDb, "Glossary", 2008)

"A series of processes all in a row, linked by pipes, where each passes its output stream to the next." (Jon Orwant et al, "Programming Perl" 4th Ed., 2012)

"Description of the process workflow in sequential order." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"In data processing, a pipeline is a sequence of processing steps combined into a single object. In Spark MLlib, a pipeline is a sequence of stages. A Pipeline is an estimator containing transformers, estimators, and evaluators. When it is trained, it produces a PipelineModel containing transformers, models, and evaluators." (Alex Thomas, "Natural Language Processing with Spark NLP", 2020)

"Abstract concept used to describe where work is broken into several steps which enable multiple tasks to be in progress at the same time. Pipelining is applied in processors to increase processing of machine language instructions and is also a category of functional decomposition that reduces the synchronization cost while maintaining many of the benefits of concurrent execution." (Max Domeika, "Software Development for Embedded Multi-core Systems", 2011)

"A technique that breaks an instruction into smaller steps that can be overlapped" (Nell Dale & John Lewis, "Computer Science Illuminated" 6th Ed., 2015)

[pipeline pattern:] "A set of data processing elements connected in series, generally so that the output of one element is the input of the next one. The elements of a pipeline are often executed concurrently. Describing many algorithms, including many signal processing problems, as pipelines is generally quite natural and lends itself to parallel execution. However, in order to scale beyond the number of pipeline stages, it is necessary to exploit parallelism within a single pipeline stage." (Michael McCool et al, "Structured Parallel Programming", 2012)

"A data pipeline is a general term for a process that moves data from a source to a destination. ETL (extract, transform, and load) uses a data pipeline to move the data it extracts from a source to the destination, where it loads the data." (Jake Stein)

"A data pipeline is a piece of infrastructure responsible for routing data from where it is to where it needs to go and provide any necessary transformations through that process." (Precisely) [source]

"A data pipeline is a service or set of actions that process data in sequence. This means that the results or output from one segment of the system become the input for the next. The usual function of a data pipeline is to move data from one state or location to another."(SnapLogic) [source]

"A data pipeline is a software process that takes data from sources and pushes it to a destination. Most modern data pipelines are automated with an ETL (Extract, Transform, Load) platform." (Xplenty) [source]

"A data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database." (Alan Marazzi)

"A source and all the transformations and targets that receive data from that source. Each mapping contains one or more pipelines." (Informatica)

"An ETL Pipeline refers to a set of processes extracting data from an input source, transforming the data, and loading into an output destination such as a database, data mart, or a data warehouse for reporting, analysis, and data synchronization." (Databricks) [source]

"Data pipeline consists of a set of actions performed in real-time or in batches, that captures data from various sources, sorting it and then moving that data through applications, filters, and APIs for storage and analysis." (EAI)