29 March 2010

The Power of Joins – Part I: Introduction

     Joins are the heart of RDBMS, they allowing to retrieve data (bring data together in a single result dataset) from multiple (related) table-like objects (tables, views, user-defined functions) by defining how and which  records are selected from such objects. In order to reduce the complexity of such a topic I will introduce the joins with regard to tables, the concepts applying also to views and user-defined functions, especially when they are regarded as “virtual” tables. In theory any two tables could be joined independently on whether a direct or indirect relation exists between them, it can be a foreign key, a natural key, a date or any other attribute that could be used to get a certain view into the data. Typically a join involves a minimum of two tables, but before going dipper into the topic it makes sense to look a little at the various terms used to refer to the tables joined and the anatomy of a join.

     There are various ways of naming the tables involved in a join:
- left vs. right tables, where the left table is the first listed table participating in the join and the right table the second listed table participating in the join;
- parent vs. dependent tables, refers to tables between which exists a direct relationship, in other words exists a primary key and a foreign key, the dependent table being the table containing the foreign key, and the parent table the table containing the primary key;
- parent vs. child tables, is similar with the previous definition, the child table being an alternative naming for the dependent table, in addition we can talk also about grandchildren when a child table is referenced at its turn by other table called thus, when the three tables are considered together, the grandchild table;
- referent vs. referenced tables, another alternative for naming the tables between which exists a direct relationship, the referenced table being the table containing the foreign key, and the referent table the ones containing the primary key;
- inner vs. outer tables, these terms are used only in the case of outer joins, the inner table being the table from which are selected all the records, while the outer table is the table from which are selected only the matched records, therefore the two are also called sometimes the row-preserving table, respectively the null-supplying table;

      None of the above pair of naming conventions are perfect because they can’t be used to address all the join situations in which tables are involved, and some of the terms may be used interchangely, especially when a term is preferred in order to send across adequate meaning, for example referring to a table as a inner table then we imply also that we deal with a left or right join. I realized that in time I used all the the above terms in a situation or another, however I prefer to use the left vs. right tables denomination, and occasionally inner vs. outer tables when referring to left or right joins.

     In addition to the two tables joined a query typically includes also a join operator that specifies the join type (union, inner, outer or cross join), the join constraints that specifies which attributes are used to join the tables and the logical operators used, and eventually the non-join constraints referring to other constraints than the join constraints. I’m saying typically because either of the two could be missing, thus if the the join constraint and operator miss then we deal with a carthezian join, while if the join constraint is appearing in the WHERE instead of FROM clause then there is no need to specify the join operator or in case of a UNION there is no need to specify the join constraint, though I will come back later on this topic.

      The work with joins is deeply rooted in the Set Theory (the tables joined referred also as datasets) and the set operations – union, subtraction and intersection, their projection in world of databases coming, from my point of view and not sure if the database literature really discusses about this, into two flavors – vertical vs. horizontal joining. By vertical joining I’m referring to the intrinsic character of joins, in the result dataset the attributes from the two joined datasets appearing on the same row, while in the horizontal joining they never appear on the same row, thus a row being formed exclusively from the records of only one of the tables involved in the join. Thus we can discuss about vertical/horizontal joins as projection of union, intersection and subtraction operations, each of them with different connotations, the horizontal operations being actually similar to the ones from set theory. If we discuss about joins then we inevitably we have to mention also the anti-joins and semi-joins, a special type of queries based on NOT IN, respectively EXISTS operators.

No comments: