SQL Troubles: cardinality

Showing posts with label cardinality. Show all posts

17 January 2010

🧭Business Intelligence: Enterprise Reporting (Part II: Levels of Detail)

Working with data mainly from frontend (User Interface), in general Users have limited knowledge on how the data are physically stored in the various systems existing in an organization. When a new report is needed they point out the attributes they know from the screens they are working with, falling in developers’ duties to figure out whether the “soup of attributes” makes really sense and find a workable solution. Once the developer has identified the attributes and the tables they are stored in, he/she can go on and create the query/queries on which the report will be based upon. For this task is important to know how the various tables relate to each other, in other words knowing the attributes which can be used to link the tables or which is the relational path to link the table, and relations’ cardinality, which reflects how the number of records of a table or data set changes when is joined with another table or data set.

Between two tables and extensively two data sets the relations could have any of the four types of cardinality – many-to-many (represented as m:n), one-to-one (1:1), one-to-many (1:n) and the reverse many-to-one (n:1). It could be given a definition for each of the cardinality types, though it’s easier to remember that in the x-to-y compounds, between two tables or data sets A and B, x has the value ‘one’ when the number of references from table B to any record from A is maximum 1 (there could be records not referenced at all), and ‘many’ when at least a record could be referenced more than once; the same logic applies for y but inversing the tables’ perspective.

The level of detail (LOD) of a report (aka data granularity) is directly correlated with the changes in cardinality - by adding a data set to an existing data set, if the cardinality of one-to-many and many-to-many is implied then the level of detail changes too. In other words if a record in a given data set is referenced more than once in the new added table then the LOD changes. A simple example of change of LOD occurs in parent-child/master-details relations. For example if the Invoice Lines are added to a data set based on Invoice Header then the LOD changes from Invoice Headers to Invoice Lines. If the Payments are added to the resulted data set then the LOD changes from Invoice Lines to Payments only if there are multiple Payments for an Invoice (one-to-many or many-to-many cardinalities), otherwise LOD remains the same.

Summarizing, the level of detail of a report is the lowest level at which a cardinality change occurs at entity level. It can be discussed about lower or higher LOD in relation to a given level of detail, for example Invoice Payments have a lower LOD than Invoice Lines, while Invoice Header a higher LOD.

Why is it necessary to introduce the LOD especially when considering a relatively complex notion as cardinality?! It worth defining it mainly because the term is used when defining/mitigating reporting requirements, its use being intuitive rather than well-defined and understood. When creating reports it’s important to find the adequate LOD for a report and know what methods can be used in order to pull data that would normally change reports’ LOD without actually changing reports’ LOD.

The limitations imposed by the need to report at a certain LOD higher than the one implied by the row data can be overcome with the help of aggregate functions used with GROUP BY constructs. Such constructs make it possible to bring into a report data found at lower LOD than the reporting LOD by grouping the data based on a set of attributes. The aggregate functions provide functionality to calculate for a set of values the maximum, minimum, sum, count of records, standard deviation, and other statistical values, all of them working on numeric data types, while maximum/minimum and count of records work also with dates or alphanumeric data types.

For example using aggregate functions the Invoice amounts can be aggregated and shown at Invoice Header level, however, in contrast, it’s not possible to do the same with quantities, because typically each Invoice Line refers to a specific Product, and as apples can’t be counted with peaches, this means that in order to see the quantities, the level of detail has to be changed from Invoice Header to Invoice Line. The same technique could be applied to similar parent-child (master-details) relations covering one-to-many or one-to-one cardinality, and also many-to-many relations that involve higher reporting complexity.

Direct many-to-many relations are seldom, involving mainly data sets, the attributes used in relation appearing more than once in each dataset (at least once). Taking an example from ERP world, there could be multiple Receipts and Invoices for the same Purchase Order, thus if there is no direct match between Receipts and Invoices to identify uniquely which Invoice belong to each Receipt, a report involving all the three entities could be barely usable, records being duplicated (e.g. 2 Invoices vs. 3 Receipts result in 6 records in a such scenario). For such reports to be usable the LOD needs to be change at a higher level at least on one side, thus reconsidering the mentioned example a report could show the Purchase Order details and aggregating the Invoice & Receipt Quantities/Amounts at Purchase Order level, or show detailed Invoices with aggregated Receipt information, respectively detailed Receipts with aggregated Invoice information.

Unfortunately even if the aggregate functions are quite handy they have their own limitations, being difficult to use them in order to answer to questions like “what was the last product sold for each customer”, “which was latest invoice placed for each customer”, at least not without considerable effort from developer’s side. Fortunately database vendors implemented their own more specialized type of aggregate functions - analytic functions in Oracle and window functions in SQL Server, they allowing modeling such questions with a report. The downside for developers is that each database vendor comes with its own philosophy, so techniques and features working in one database might not work in another.

Previous Post <<||>> Next Post

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

03 July 2009

🛢DBMS: Cardinality (Definitions)

"The classification of a relationship; for example, one-to-many, many-to-many, and so on." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"The number of tuples (rows) in a relationship. For example, a relationship can be one-to-one, one-to-many, or many-to-many." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The number of unique values for a given column in a relational table. Low cardinality refers to a limited number of values, relative to the overall number of rows in the table." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"Cardinality denotes the maximum number of occurrences of one entity that can be related to another entity. Usually, these are expressed as “one” or “many.” Change Data Capture Change data capture is a technique for propagating only changes to source data through the data acquisition process." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"The number of distinct values in a column of a table." (Bob Bryla, "Oracle Database Foundations", 2004)

"The cardinality of a relationship represents the number of occurrences between entities. An entity with a cardinality of one is called a parent entity, and an entity with a cardinality of one or more is called a child entity." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The number of distinct values taken on by an attribute." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"The number of tuples in a relation." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A representation of the minimum and maximum allowed number of values for an attribute. In semantic object models, written as L.U where L and U are the lower and upper bounds. For example, 1.10 means an attribute must occur between 1 and 10 times." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A relationship in a data model denoting how many instances of one entity class can be related to an instance of another entity class - zero, one, or many." (Danette McGilvray, "Executing Data Quality Projects", 2008)

"The measure of the number of elements within a set of values. For example, the set A = { 2, 4, 6 } contains 3 elements, and has a cardinality of 3." (MongoDb, "Glossary", 2008)

"In relationships, the characteristic of a relationship that specifies the upper and lower bounds of how many instances of one entity or object type can be related to each instance of the same or some other entity or object type. Cardinality is separately specified at each end of the relationship. At each end the choices are 0, 1, or M. Combining the cardinality at both ends of a binary relationship, yields 3 x 9 - 1 = 8 possibilities (0:0 is not a valid option)." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The number of entities or members in a set." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"The number of entities that can exist on each side of a relationship." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The number of occurrences that may exist between a pair of entities. Another way of looking at cardinality is as the number of entity occurrences applicable to a specific relationship. Sometimes the term degree is used instead of cardinality. An alternate usage of the term cardinality within the realm of database administration is a database statistic used by the relational optimizer defining the number of occurrences of a value within a column (or set of columns)." (Craig S Mullins, "Database Administration", 2012)

"The number of rows that is expected to be or is returned by an operation in an execution plan. Data has low cardinality when the number of distinct values in a column is low in relation to the total number of rows." (Oracle, "Database SQL Tuning Guide Glossary", 2013)

"The number of occurrences of two units of data that participate in a relationship" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The cardinality of a relationship is the number of instances that can be associated with each entity type in a relationship." (Robert J Glushko, "The Discipline of Organizing: Professional Edition, 4th Ed", 2016)

"The number of rows in a database table or the number of elements in an array. See also associative array." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

SQL Troubles

Pages