SQL Troubles

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

Data Management Series

Completeness refers to the extent to which there are missing data in a dataset, fact reflected in the number of the missing values, also referred as empty (when an empty string or default values is used) or 'Nulls' (aka unknown values), and/or in the number of missing records.

The missing values are typically considered in report to mandatory attributes, attributes that need a not-Null value for each record, though after case might be applied to non-mandatory attributes (optional attributes) too, for example when is intended to understand whether the attributes are adequately maintained or not. It’s interesting that [1] considers also the inapplicable attributes referring to the attributes not applicable (relevant) for certain scenarios (e.g. physical dimensions for service-based materials), which together with the applicable attributes (relevant) can be considered as another type of categorization for attributes. Whether an attribute is mandatory is decided upon business context and not necessarily upon the physical structure containing the attribute, in other words an attribute could be optional as per database schema and mandatory per business rules.

'Missing records' can be a misleading term because is used in several contexts, however within data completeness context it refers only to the cases not covered by data integrity. For example in parent-child table relations the header data was entered though the detail data is missing, either not entered or deleted; such a case is not covered by referential integrity because there is no missing reference, but just the parent without child data (1:n cardinality).

A mixed example occurs when the same entity is split across several tables at the same level of detail. One of the tables must function as a parent, falling in the previous mentioned example (1:1 cardinality). In such a scenario it depends how one reports the nonconformances per record: (1) the error is counted only once, independently on how many dimensions an error was raised; (2) the error is counted for each dimension. For (2) when the referential integrity failed, an error is raised also for each mandatory attribute.

Both examples are dealing with explicit data referents – the 'parent' data, though there are cases in which the referents are implicit, for example when the data are not available for a certain time interval (e.g. period, day) even if needed, though also for this case the referents could be made explicit, falling in the previous mentioned examples. In such scenarios all the attributes corresponding to the missing records will be null.

Normally the completeness of parent-child relations is enforced with the help of referential integrity and database transactions, a set of actions performed as a single unit of work, they allow saving the parent data only if the child data were saved successfully, though such type of constraints is not always necessary.

Data should be cleaned when feasible in the source system(s) and this applies to incomplete data as well. It might be feasible to clean the values in Excel files or similar tools, by exporting and then reimporting the clean values back into the respective systems.

In data migrations or similar scenarios, the completeness in particular and data quality in general must be judged against the target system(s) and thus the dataset must be enriched in an intermediate layer as needed. Upon case, one can consider using default values, though this sounds like a technical debt, likely improbably to be addressed later. Moreover, one should prioritize the effort and consider first the attributes which are needed for the good functioning of the target system(s).

Ideally, for the mandatory fields should be applied data validation techniques in the source systems, when feasible.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing.

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time.

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

Data Management Series

Uniqueness refers to "requirements that entities modeled within the master environment are captured, represented, and referenced uniquely within the relevant application architectures" [1]. An alternative name is the one of duplicates, which stresses the existence of duplicate records within a dataset, the not-uniqueness, being a better indicator for the nonconformance especially when considering datasets.

Why is required to enforce the uniqueness of entities? An entity is defined using a number of attributes representing entity’s characteristics, in case the attributes of two entities have the same values, then more likely the two representations refer to the same entity. This could happen in most of the cases, though there are situations in which the attribute(s) that make(s) it possible to differentiate between two distinct entities is/are not adequately maintained or not considered at all. The impossibility of identifying uniquely an entity increases the chances of using one of the respective entities wrongly, for example booking the Invoice against the wrong Vendor and all the implications derived from it.

For each type of entity there can be one or more attributes that allow identifying it uniquely, for example in case of a Vendor could be Vendor’s name and address. The more such attributes then more difficult becomes the identification of a Vendor; therefore even if such a set of attributes exists, like in Vendor’s case, it’s preferable to use instead a unique identifier, a numeric or alphanumeric key that identifies uniquely an entity.

A Vendor could be uniquely identified by the Vendor ID, though it allows unique identification of a Vendor only in a data repository, the chances being quite high to have another Vendor ID for the same entity in another data repository. Therefore, in order to guarantee the uniqueness of entities is prefer to use instead an attribute that has the same value indifferently from the data repository the entity is stored in (e.g. Vendor Number, Item Number, Customer Number, Asset Number, Sales Order Number, etc.). Such attributes could be enforced to be unique across a set of data repositories, though one of them must function as 'master'.

Multiple identifiers for the same entity may exist, though this can easily create confusion, especially when this happens within the same system and people or machines are not aware that the respective identifiers refer to the same entity, and the more identifiers we have for the same entity the higher the chances of creating confusion. Imagine that in the same system you book some of the Invoices against one of the identifiers, and the remaining Invoices against another identifier of the same entity.

Especially in reports this might be quite a problem as amounts that should appear for the same entity are split against two references, and even if they refer to the same entity report’s users might be not aware of it, the calculations based on such numbers not reflecting the reality. Imagine that you have booked the invoices against two references to the same Vendor, and you want to consider the first 10 Vendors with the highest volume spent; it might happen that the aggregate amounts for each of the respective references didn’t make it in the top 10, though when considered together they can even make it to the first position.

There are even situations in which the use of multiple identifiers for the same entity is inevitable, and important are here the cases in which more than one different system refer to the same entity in their unique way, often being not designed to use a global unique identifier for an entity. Under such circumstances what an organization could do is to either extend the system(s) to store the global identifier or have a mapping in place with the referents used to indicate the same entity.

The more systems within an enterprise that use/represent the same entity, the more complex is to manage the correct referencing; the logical thing to do is to have in place a master system that stores the global identifier and integrate it with the other systems in place or at least store the global identifier in each of the systems. Such identifiers used for master or even transactional data allow creating consolidated reports based on several systems and mitigating possible issues existing in the data.

Even if such keys are designed to enforce uniqueness this doesn’t mean it necessarily happens; the reason is simple - behind the assignment of a key to an entity there is a process defined, and no matter how robust a process was created if it doesn’t cover all the scenarios that might occur or if it can’t deal (adequately) with various constraints, then in one point in time a flow might lead to a duplicate, a not unique record. Take for example a Product Number's creation - supposing that a Product Number XYZ is created in an ERP system and one or more of the respective Product’s attributes are incorrect. In many cases such errors can be corrected, though each system comes with its own constraints not allowing changing certain attributes (e.g. the Unit of Measure once Inventory was booked). Because an entity with the wrong attributes is inadequate to be used, a new Product Number is created for the same entity, the two identifiers coexisting for some time. (Some systems allow to build a relationship between the Products, showing that one Product replaces another going forward.)

Some systems could allow the deletion of the first entered Product though that’s not always possible. Even if the two Products Numbers are different, they point to the same Product entity, and even if the first entered Product is marked as obsolete, until this happens transactions might be already made with it, and again, it might be possible to revert the transactions back, though that’s not always possible, and that’s not all, once a Product is marked as obsolete, this doesn’t necessarily imply that a User won’t reactivate the Product in the near or far future. As can be seen, the scenarios are quite complex, not all the systems and processes being designed to handle them. If such deviations from uniqueness can’t be corrected in the legacy systems, at least during conversion and migration should be corrected.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] David Loshin (2009) "Master Data Management"

🗄️Data Management: Data Quality (Part II: An Introduction)

Introduction

From time to time, I found myself in a meeting with people blaming the poor quality of data for the strange numbers appearing in a report, and often they were right. Sometimes you can see that without going too deep into details, other times you have to spend some considerable amount of time and effort and look over the raw data to identify the records that cause the unexpected variations. We know we have a problem with data quality, though why nobody is doing anything to improve it? Who’s responsible in the end for data quality? How data quality issues appear, how the data quality can be improved and by what means? When and where the data quality must be considered? Why do we have to consider data quality and in definitive what is data quality? Which are data quality’s characteristics or dimensions?

Without pretending to be an expert on data quality but using my and others’ experience and understanding in dealing with data quality, I will try to approach the above questions in several posts. And as usual, before answering more complex questions, inevitably we have to define first what data quality is.

What is Data Quality?

In several consulted sources, mainly books and blogs, data quality’s definition revolve around the 'fitness for use' syntagma (e.g. [1], [3], [4]). The definition makes data quality highly dependent on the context in which it’s considered because each request comes with different data quality expectations/requirements and data quality levels. Even more, the data could have acceptable quality in the legacy system(s), the system(s) used to store the data, though the data might not be appropriate for use in the current form. From this observation can be delimited two aspects of data quality – intrinsic data quality considering the creation and storage of data, and secondly contextual data quality referring to the degree the data meets users’ needs.

Except these two aspects [2] considers two other aspects: the representation data quality referring to the "degree of care taken in the presentation and organization of information for users" and the accessibility data quality - "degree of freedom that users have to use data, define or refine the manner in which information is input, processed or presented to them" [2]. These four aspects provide a good framework for categorizing data quality and as can be seen, it targets the four coordinates of data – storage/creation, presentation, use and accessibility – however it ignores the fact data could be consumed also by machines, they needing to parse, understand and process the data. Therefore, could be added a fifth aspect – processing data quality, referring to the data readiness for consumption by machines, this aspect including the existence of metadata (e.g. data models, data mappings, etc.).

Why Data Quality?

Considering the growing dynamics and complexity of economic landscape, nowadays organization’s capacity of staying in business is highly dependent on decision making process’ accuracy, which at its turn is dependent on the reports’ quality used as support in decisions, and further on quality of data the reports are based on. If you have wrong data, you can’t expect the reports to show the correct numbers, same as you can’t expect a decision to be correct without having the actual situation correctly reflected even if occasionally the person taking the decision has a sound understanding of the business above the data level, or a good gut feeling. It is essential for an organization to have adequate data and reports that reflect the current state of art at any time and at the requested level of detail.

In this context is discussed about poor data quality, good data quality, high data quality, or even adequate/inadequate data quality. Paraphrasing data quality’s definition could be said that data are of high quality if they are 'fit for use' [1], and have poor data quality otherwise, though in life things are not black or white, but exist many other nuances in between, dependent on user’s perception and needs. For example, could be sufficient in one million records to have only one value that makes it impossible for a data set to be processed by a software package, and even if the data quality equates with a Six Sigma level, it’s not enough for this case. At the opposite side, a data set with a huge number of defects could be processed without problems and the outcome could be close to what it’s expected, now depends also on the nature of defects, which are the attributes affected and how they affect the result.

Poor data quality reduces considerably the value of data, impeding organizations to exploit the data at their full potential, a fact that limits their strategic competitive advantage. Poor data quality has a domino effect, often involving rework, delays in production, operations and decision making, a decrease in customers’ satisfaction reflected in revenues and lost opportunities, and so on. Unfortunately, the costs of poor data quality are difficult to quantify, they include the cost for finding and fixing the errors, the cost of wasted effort, lost production, and lost opportunity, plus the ones involved with wrong decision making. The same issue affects the costs involved in achieving high data quality levels, the easiest to quantify being the costs directly involved with the data quality initiatives. It's not in scope to detail these aspects, though is worth mentioning them to highlight why many organizations ignore the quality of their data.

How do Data Quality issues appear?

No system is bullet proof, no matter how good it was designed there will always be a chance for errors to appear, even if it’s the software itself, the people, the processes, standards, or procedures. A primary reason for data issues is caused by the weak validation implemented in the systems, either in the UI or in any other data entry points, by the errors that escaped the vigilant eyes of testers, by the (lack of) flexibility/robustness of applications, data models in general and data types in particular. On a secondary plane but not less important is the existence and adherence to processes, standards, or procedures, of whether they are known and respected by users, or whether they cover all the scenarios occurring in the daily business activity.

Who’s responsible for Data Quality?

As everybody in an organization is responsible for quality, the same applies also for data quality, from data entry to decision making everybody’s working with data directly or indirectly, and they are also theoretically responsible for it. From data life-cycle perspective there are the people who are entering the data in the various systems, the ones who are preparing the data for further usage, the ones using the data, and the ones managing the data quality initiatives (e.g. quality leaders, black belts, champions). The first category needs to take the ownership over the data, maintain them adequately, stick to the processes, standards and procedures defined, and together with the other users are responsible for identifying the issues in data, mitigate them and find solutions.

The data users are maybe the broadest category, ranging from simple employees using the data for daily activities to executives. Because often the data they are working with are aggregated at various levels, unless they have a good overview of the business or flexible reports that gives them access also to the row data it will be almost impossible for them to realize the actual data quality. Often the people preparing the data are the ones discovering the issues in data, though this happens by hazard, observing abnormalities in the data.

The people managing the data quality initiatives are hardly working with the data, though together with the other type of executives they are responsible for modeling an organization and creating a (data) quality culture.

How can be the Data Quality improved and by what means?

It takes time, resources, consolidated effort, adequate policies, and management’s involvement to reach an acceptable level of data quality. As already stressed data quality can be improved by creating robust processes, standards, or procedures, and sticking to them, creating a culture for quality in general and for data quality in particular, but rather they are the outcome of data quality initiatives.

The first step is recognizing that data quality in the organization as a problem and secondly identifying the nature of the issues eventually by conducting a data quality assessment, though this involve the definition of data quality (business) rules, rules against the quality of data will be checked, establishing an assessment, cleaning, and reporting framework for this purpose. Because improvement is the word of the day, could be conducted Six Sigma projects to identify issues’ root causes and the solutions to address them, monitor issues resolution and whether the found solutions fulfill the intended goals. The use of (Lean) Six Sigma methodology allows introducing quality concepts in the organization, making people aware of the premises and implications of data quality, and in addition it offers a receipt for success.

Depending on the volume of data and the number of issues found, data quality improvement might occur within several iterations, and when the volume of issues is more than organization can handle, then it makes sense to prioritize and fix the issues that impact the organization the most, following to address the remaining issues in time. There will also be situations in which is needed to redesign the existing processes, modify the legacy systems to limit the data entry issues or even start new projects in order fill the gaps.

When must be Data Quality considered?

Data quality comes in discussion especially when the data related issues impacts visibly the business, whether something is done to correct them or not that’s something dependent on whether an organization has in place a data quality vision, policy and adequate mechanisms for assessing and addressing the data quality on a regular basis, or whether the people or departments impacted by the issues take the problem in their hands and do something about it.

Another context in which data quality topic appears is during migration, conversion, and integration between systems, when the system(s) into which the data are supposed to be loaded, the destination system(s), come(s) with a set of rules/constraints requesting the data to be in a certain format/formatting, match specific data types and other data storage related constraints. Even if the data from the legacy system(s) are considered of high quality, in the mentioned activities the data quality is considered against the constraints imposed by the destination system.

What needs to be stressed is that data quality is not one time endeavor but a continuous and consolidated effort that could span along the life cycle of the systems storing the data. How often data quality assessments must be conducted? It depends on data growth and perceived quality, organizations’ data quality policy/vision and the degree they are respected, organization overall maturity and culture, availability of resources and the different usage of data.

Where is Data Quality considered?

Data quality is generally considered against the legacy systems the data following to be cleaned in them, though when data quality is judged against external requests, the data could be cleaned in Excel files or in any other environments designed for this kind of tasks, each scenario having its pluses and minuses, sometimes being preferred a hybrid solution between the two alternatives. Most of the time it makes sense to clean the data in the legacy system because in general it minimizes the overall effort, the data are “actual” and reflected in the other systems using the data.

Excel or any other similar tools could prove to be better environments for cleaning the data for one time requests, when is needed to do fast mass updates or is not intended/possible to modify the legacy data, however the probability of making mistakes is quite high because the validation implemented in the legacy systems are not available anymore or must be enforced, multiple copies of the same data could exist in case the data needs to be clean by multiple users, while the synchronization between legacy data and cleaned data could be problematic.

There are situations in which the data are not available in any of the systems available in place and there is no adequate possibility to store the new data in them, thus being requested eventually to extend one or more systems (aka data enrichment), solution that could be quite expensive, an alternative being to store the data in ad-hoc solutions (e.g. Excel, MS Access), following to merge the existing data islands (aka data integration) before loading them into the destination system(s). When dealing with a relatively small number of records the data could be entered manually in the destination system itself, though there are all kind of constraints that need to be considered, therefore it is safer and more appropriate to load the data as contiguous sets.

Previous Post <<||>> Next Post

Written: Dec-2009, Last Reviewed: Mar-2024

References:
[1] Joseph M Juran & A Blanton Godfrey (1999) Juran’s Quality Handbook, 5th Ed.
[2] Coral Calero (2008). Handbook of Research on Web Information Systems Quality
[3] US Census Bureau (2006) Definition of Data Quality: Census Bureau Principle, version 1.3. [Online] Available from: http://www.census.gov/quality/P01-0_v1.3_Definition_of_Quality.pdf (Accessed: 24 December 2009).
[4] OCDQ (Obsessive-Compulsive Data Quality) Blog (2009) The Data-Information Continuum. [Online] Available from: http://www.ocdqblog.com/home/the-data-information-continuum.html (Accessed: 24 December 2009)

🕋Data Warehousing: Dimension Table (Definitions)

"A table in a data warehouse whose entries describe data in a fact table. Dimension tables present business entities." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A table in a data warehouse whose entries describe data in a fact table. Dimension tables present business entities. A database object stored in a data warehouse containing information used to reference the data stored in a fact table." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"The relational database table that contains information about each member of a dimension, such as its name, as well as other specific characteristics of each member." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A table in a star schema design that contains dimensional attributes and a surrogate key." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A table that contains the data from which dimensions are created." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"In the dimensional data model, each dimension table contains the attributes of a single business dimension. Product, store, salesperson, and promotional campaign are examples of business dimensions along which business measurements or facts are analyzed." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A table in a data warehouse whose entries describe data in a fact table." (SQL Server 2012 Glossary, "Microsoft", 2012)

"The representation of a dimension in a star schema. Each row in a dimension table represents all of the attributes for a particular member of the dimension. See also star join, star schema." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The smaller tables used in a data warehouse to denote the attributes of a particular dimension, such as time, location, customer characteristics, product characteristics, etc." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

10 January 2010

🕋Data Warehousing: Dimension Table (Definitions)

"A table in a star schema design that contains dimensional attributes and a surrogate key." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A table that contains the data from which dimensions are created." (Jim Joseph et al, "Microsoft® SQL Server™ 2008 Reporting Services Unleashed", 2009)

"A table in a data warehouse whose entries describe data in a fact table." (SQL Server 2012 Glossary, "Microsoft", 2012)

🕋Data Warehousing: Dimension (Definitions)

"A structural attribute of a cube, which is an organized hierarchy of categories (levels) that describe data in the fact table. These categories typically describe a similar set of members upon which the user wants to base an analysis. For example, a geography dimension might include levels for Country, Region, State or Province, and City. See also level; measure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A list of labels that can be used to cross-tabulate values from other dimensions." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A shorthand term that is used to refer to a dimension table or a dimension attribute." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A group of related objects within a cube that's used to provide information about related data. For example, a product dimension could include a product name, a product category, a product size, product cost, and product price." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A structural attribute of a cube upon which the user wants to base an analysis (for example, geography dimension). Dimension describes data in a fact table." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"Major business categories of information or groupings to describe business data. Dimensions contain information used for constraining queries, report headings, and defining drill paths. Within a dimension, specific attributes are the data elements that are used as row and column headers on reports. Dimensional attributes are also considered to be reference data. When describing the need to report information by region, by week, and by month, the attributes following ''by'' are dimensions. Each of these would be included in a dimension." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"An aspect of data that provides a way to divide it in an OLAP database (for example, a carmaker's OLAP database may organize product data by the dimensions of model, body style, engine type, and price point)." (Ken Withee, "Microsoft® Business Intelligence For Dummies®", 2010)

"A slice of data used in analysis and reporting. For example, in a report that shows sales by customer and product for the year ending December 2009, "customer," "product," and "time" would be the dimensions used." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"In a data warehouse, a data element that categorizes each item in a data set into nonoverlapping regions." (Craig S Mullins, "Database Administration", 2012)

"In multidimensional data, a structural attribute of a cube that organizes data to enable in-depth business analysis." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

🕋Data Warehousing: Fact Table (Definitions)

"A central table in a data warehouse that contains numerical measures and keys relating facts to a dimension table. Fact tables contain data that describes a specific event within a business, such as a bank transaction or product sale. See also data warehouse; dimension table; snowflake schema; star join; star schema." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"The relational database table that contains values for one or more measures at the lowest level of detail for one or more dimensions." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A table in a star schema that contains facts and foreign keys that reference dimension tables. Fact tables may also contain degenerate dimensions." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A central table in a data warehouse schema that contains numeric measures and keys relating facts to dimension tables. Fact tables contain data that describes specific events within a business, such as bank transactions or product sales." (Jim Joseph et al, "Microsoft® SQL Server™ 2008 Reporting Services Unleashed", 2009)

"In the dimensional data model, the middle table that contains the facts or metrics of the business as attributes in the table. Sales units, sales dollars, costs, and profit margin are examples of business metrics that are analyzed." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The dominating table in a data warehouse and its star schema, containing dimension attributes and data measures at the individual data level." (Toby J Teorey, ", Database Modeling and Design 4th Ed", 2010)

"A central table in a dimensional model that contains numerical measures and key facts relating to dimension tables." (Craig S Mullins, "Database Administration: The Complete Guide to DBA Practices and Procedures 2nd Ed", 2012)

"The data structure where basic facts in a star join are stored" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A relational table that contains facts, such as units sold or cost of goods, and foreign keys that link the fact table to each dimension table. See also star join." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

31 December 2009

🛢DBMS: Database Design (Just the Quotes)

"The database is the heart of our information processing systems and not simply a by-product of processing. […] A database that has been properly structured and organized, and which actually represents and contains that data of an enterprise, will stabilize the information processing systems. [...] The use of data is reckoning, the generation of information which will in turn be used to manage the enterprise. The generation of information is the only reason for having a database." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"A database has structure which exists in three forms. It has a perceived structure, which is the way that each of us looks at the database. […] The physical structure of the database relates to the manner in which data is physically represented within the database in terms of sort sequence, record and field sizes, and its existence on one or more master files. […] The logical structure describes the way things really exist within the environment. The entities, i.e., the things within the enterprise, really exist; they have properties which distinguish them from all the other things within the business; and the relationships which exist among the entities." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"Two of the most difficult areas of data-base management are the design of an information structure and the reduction of that structure to a data structure which is compatible with and managed by the DBMS. […] Data-base management systems are tools to be applied by the users of these systems to build an accurate and useful model of their organization and its information needs. To accomplish this, the information structure must accurately define and characterize the items of data and the relations among them that are of interest to the users. This is no small task, for it demands a knowledge of the organization and the distribution of information among its various parts." (Robert W Taylor & Randall L Frank, "CODASYL Data-Base Management Systems", 1976)

"A variety of reasons make careful database design essential. These include data redundancy, application performance, data independence, data security, and ease of programming. All are important factors in the data processing environment, and all can be adversely affected by a poor database design." (Mark L Gillenson, "Database: Step by Step", 1985)

"Database design refers to the process of organizing the data fields needed by one or more applications into an organized structure. This structure must foster the required relationships among the fields while conforming to the physical constraints of the particular database management system in use. There are two parts to the process that people usually associate with the term 'database design'. One is 'logical database design' which has two components. The first involves organizing the data fields into nonredundant groupings based on the data relationships. The second involves an initial organizing of those logical groupings into structures based on the nature of the DBMS and of the applications that will use the data. […] The second part of database design is 'physical database design, which refers to refitting the previously described derived structures to conform to the performance and operational idiosyncrasies of the DBMS, again guided by the application's processing requirements." (Mark L Gillenson, "Database: Step by Step", 1985)

"The conceptual model is the focal point of the process of database design. All activities either converge upon or emanate from the conceptual model. All structures, through mappings either into or out of the conceptual model, must be to some extent compatible with it." (Andrew Pletch, "Conceptual Modeling in the Classroom", SIGMOD Record 18(1), 1989)

"Because one has to be an optimist to begin an ambitious project, it is not surprising that underestimation of completion time is the norm." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"It is important to emphasize the value of simplicity and elegance, for complexity has a way of compounding difficulties and as we have seen, creating mistakes. My definition of elegance is the achievement of a given functionality with a minimum of mechanism and a maximum of clarity." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"A database is basically a collection of related facts (e.g. a company's personnel records, or a bus timetable). Discovering the essential kinds of facts that underlie an application, and the conditions that apply to them, is both interesting and illuminating. The quality of the database design used for these facts and conditions is important. Just as a house built from a good architectural plan is more likely to be safe and convenient for living, a well-designed database simplifies the task of ensuring that its facts are correct and easy to get at." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"If an application requires maintenance and retrieval of large amounts of data, a DBMS offers many advantages over manual record-keeping systems. Operations on data may often be performed faster. Data may be stored compactly on disk, and redundancy may be reduced by data sharing. Many data errors can be avoided by automatic integrity checking. With multi-user systems, access rights to data can be enforced by the system. People can spend more time on creative design rather than on routine tasks more suited to computers."(Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"No matter how sophisticated the information system, if we give it the wrong picture of our UoD to start with, we can't expect to get much sense out of it. This is one aspect of the gigo (Garbage In Garbage Out) principle. Most of the problems with many database applications can be traced to bad database design." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"When we design a database for a particular application, we create a model of it. Technically, the application area that we are modeling is called the universe of discourse (UoD), since it is the world (or universe) that we are interested in talking (or discoursing) about. Typically the UoD is a 'part' of the 'real world'. To build a good model requires a good understanding of the world we are modeling, and hence is a task ideally suited to people rather than machines. The main challenge is to describe the UoD clearly and precisely. Great care is required here, since errors introduced at this stage filter through to later stages in software development, and the later the errors are detected the more expensive they are to remove." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"A bad logical design means that a good physical design cannot be performed. Good logical design is crucial to good database performance, and a bad logical design will result in a physical design that attempts to cover up the weaknesses in it. A bad logical design is hard to change, and once the system is implemented it will be almost impossible to do so." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"[…] physical database design is not a static, one-off process. Once the database has gone into production, the user requirements are likely to change. Even if they do not, the database data is likely to be volatile, and tables are likely to grow." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"The physical design process is a key phase in the overall design process. It is too often ignored until the last minute in the vain hope that performance will be satisfactory. Without a good physical design, performance is rarely satisfactory and throwing hardware at the problem is rarely completely effective. There is no substitute for a good physical design, and the time and effort spent in the physical design process will be rewarded with an efficient and well-tuned database, not to mention happy users!" (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"There is no right way to design a database; there are a number of possible approaches and all these may be perfectly valid. It is sometimes said that performance tuning is an art, not a science. This may be true, but it is important to undertake performance tuning experiments with the same kind of rigorous, controlled conditions under which scientific experiments are performed." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"One problem area for refactoring is databases. Most business applications are tightly coupled to the database schema that supports them. That's one reason that the database is difficult to change. Another reason is data migration. Even if you have carefully layered your system to minimize the dependencies between the database schema and the object model, changing the database schema forces you to migrate the data, which can be a long and fraught task." (Martin Fowler et al, "Refactoring: Improving the Design of Existing Code", 2002)

30 December 2009

📉Graphical Representation: Idealization (Just the Quotes)

"Two important characteristics of maps should be noticed. A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness. If the map could be ideally correct, it would include, in a reduced scale, the map of the map; the map of the map, of the map [...]" (Alfred Korzybski, "Science and Sanity: An Introduction to Non-Aristotelian Systems and General Semantics", 1933)

"For most line charts the maximum number of plotted lines should not exceed five; three or fewer is the ideal number. When multiple plotted lines are shown each line should be differentiated by using (a) a different type of line and/or (b) different plotting marks, if shown, and (c) clearly differentiated labeling." (Robert Lefferts, "Elements of Graphics: How to prepare charts and graphs for effective reports", 1981)

"Good numeric representation is a key to effective thinking that is not limited to understanding risks. Natural languages show the traces of various attempts at finding a proper representation of numbers. [...] The key role of representation in thinking is often downplayed because of an ideal of rationality that dictates that whenever two statements are mathematically or logically the same, representing them in different forms should not matter. Evidence that it does matter is regarded as a sign of human irrationality. This view ignores the fact that finding a good representation is an indispensable part of problem solving and that playing with different representations is a tool of creative thinking." (Gerd Gigerenzer, "Calculated Risks: How to know when numbers deceive you", 2002)

"No other statistical graphic can hold so much information at a time than the parallel coordinate plot. Thus this plot is ideal to get an initial overview of a dataset, or at the very least a large subgroup of the variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Trellis displays use a lattice-like arrangement to place plots onto so-called panels. Each plot in a trellis display is conditioned upon at least one other variable. The same scales are used in all the panel plots in order to make them comparable across rows and columns. […] Trellis displays are an ideal tool to compare models for different subsets." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Two types of graphic organizers are commonly used for comparison: the Venn diagram and the comparison matrix [...] the Venn diagram provides students with a visual display of the similarities and differences between two items. The similarities between elements are listed in the intersection between the two circles. The differences are listed in the parts of each circle that do not intersect. Ideally, a new Venn diagram should be completed for each characteristic so that students can easily see how similar and different the elements are for each characteristic used in the comparison." (Robert J. Marzano et al, "Classroom Instruction that Works: Research-based strategies for increasing student achievement, 2001)

"Ideally, the charts are designed in a way that gives your audience clarity and lets them understand the key insights very quickly. Color choices, highlighting, annotations, and other ways of drawing attention to your findings help in the process. By leaving white or blank space around your charts, you are able to keep the focus of your audience on the key message rather than distracting or confusing them." (Andy Kriebel & Eva Murray, "#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time", 2018)

"Numbers are ideal vehicles for promulgating bullshit. They feel objective, but are easily manipulated to tell whatever story one desires. Words are clearly constructs of human minds, but numbers? Numbers seem to come directly from Nature herself. We know words are subjective. We know they are used to bend and blur the truth. Words suggest intuition, feeling, and expressivity. But not numbers. Numbers suggest precision and imply a scientific approach. Numbers appear to have an existence separate from the humans reporting them." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Cartographers employed a minimalist visual language and the simplicity of lines and geometry to lend an air of objectivity, universality and clarity. The sparse treatment of visuals suggests a more direct correspondence between the data and the representation, and less human involvement. It communicates that the image has been reduced to its bare minimum. It has been polished through successive passes to remove the unnecessary and the contingent. And in doing so, it indicates something essential and closer to a transcendent type, or perhaps an ideal." (Peter A Hall & Patricio Dávila, "Critical Visualization: Rethinking the Representation of Data", 2022)

29 December 2009

🛢DBMS: Database Management Systems (Just the Quotes)

Disclaimer: The following definitions are intended for tracing the long road of defining what Database Management Systems (DBMS) are about, therefore the value of the quotes is regarded from the historical perspective.

"The software that allows one or many persons to use and/or modify this data is a database management system (DBMS). A major role of the DBMS is to allow the user to deal with the data in abstract terms, rather than as the computer stores the data. In this sense, the DBMS acts as an interpreter for a (very) high-level language such as APL, ideally allowing the user to specify what must be done, with little or no attention on the user's part to the detailed algorithms or data representation used by the system. However, in the case of a DBMS, there may be even less relationship between the data as seen by the user and as stored in the computer, than between APL arrays and the representation of these arrays in memory." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"An integrated data management system, that is a true database system, is one in which data can be held nonredundantly while at the same time, a query which requires a mixture of different kinds of data (such as the query that we've been looking at) can be specified in a single command from the highest level programming interface. Any system that does not have this property really should not call itself a 'database' system; 'file management' system would be a more accurate term." (Mark L Gillenson, "Database: Step by Step", 1985)

"The DBMS is a program (or a set of programs) that allows stored data to be integrated, reduces data duplication, ensures data integrity, eliminates program dependency on file formats, and allows even complicated objects to be easily represented and retrieved. In short, a DBMS is the program that 'processes the database'." (David M Kroenke & Kathleen Dolan , Database Processing: Fundamentals, design, implementation” 3rd Ed., 1988)

"A database management system (DBMS) consists of a collection of interrelated data and a set of programs to access that data. The collection of data, usually referred to as the database, contains information about one particular enterprise. The primary goal of a DBMS is to provide an environment that is both convenient and efficient to use in retrieving and storing database information." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A database management system is a collection of interrelated files and a set of programs that allow users to access and modify these files. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data is stored and maintained. However, in order for the system to be usable, data must be retrieved efficiently. This concern has lead to the design of complex data structures for the representation of data in the database. Since many database systems users are not computer-trained, the complexity is hidden from them through several levels of abstraction in order to simplify their interaction with the system." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A DBMS is a large collection of computer processes that can be entirely software or firmware that operate in a generalized or specialized computer that provides languages or mechanisms to: Define the logical database component of a specific database either directly or through reference to the IRDS metadata, that is, the data record types, data record elements, and relationships. Define the physical component of a specific database, that is, the storage structure definition, and the techniques for access strategies, data loading and updates, and database backup. Specify interrogations that access data through one or more types of languages such as host languages like COBOL or FORTRAN, and/or natural languages like query-update, report writers, or procedure-oriented. Specify system control facilities, that is, audit trails, backup and recovery, concurrent operations, and the like." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Database packages, more correctly called Database Management Systems or DBMSs, such as Oracle and Microsoft Access provide us with all the facilities necessary to design and use an application database. A DBMS can be viewed as having three components: (1) The user interface is the software that talks' with the user in creating and accessing the database. […] (2) The database engine: this is the software that manages the storage and accessing of the physical data stored on disk. (3) The data dictionary, a repository of information about the application database." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A Database Management System (DBMS) is an organised set of facilities for accessing and maintaining one or more databases. A DBMS is a shell which surrounds a database or series of databases and through which all interactions take place with the database." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A software system that enables users to define, create, and maintain the database and also provides controlled access to this database." (Thomas M Connolly & Carolyn E Begg, "Database Solutions: A step-by-step guide to building databases", 2004)

"A product or tool that manages any kind of database, not just relational databases." (Rod Stephens, "Beginning Database Design Solutions", 2009)

🛢DBMS: Databases (Just the Quotes)

Disclaimer: The definitions below are intended for tracing the long road of defining what databases are about, therefore the value of the quotes should be regarded from the historical perspective.

"Historically, data base systems evolved as generalized access methods [...] As a result, most data base systems emphasize the question of how data may be stored or accessed, but they ignore the question of what the data means to the people who use it." (John F Sowa, "Conceptual Graphs for a Data Base Interface", 1976)

"A database is a set of data whose structure can be described by a connected graph, and whose instances can be accessed simultaneously." (Rob Gerritsen et al, "On Some Metrics for Databases or What is a Very Large Database?", 1977)

"Data […] that is stored more-or-less permanently in a computer we term a database." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"When we talk informally about a database, we refer to a collection of mutually related data, to the computer hardware that is used to store it, and to the programs used to manipulate it. […] A database is a collection of related data. The data storage for a database is accomplished by the use of one or more files. All files of one database are accessible from one computer or from any interconnected computer if the database is distributed over several computers." (Gio Wiederhold, "Database Design" 2nd Ed., 1983)

"[...] there is a class of methodologies - really more of an environment - for storing, accessing, and in general, managing data which can form the foundation for survival in a world which is going to become ever more data oriented as time goes on. That class of methodologies or environment has come to be known as database." (Mark L Gillenson, "Database: Step by Step", 1985)

"A database is a self-describing collection of integrated records. […] A database is self-describing in that it contains, in addition to application data, a description of its own structure. […] A database is more than a collection of files. A database includes not only files, but also a data dictionary and a description of the relationships among the records in the files. These relationship descriptions are stored and recalled during database. […] A database is a data model of an organization." (David M Kroenke & Kathleen Dolan, , "Database Processing: Fundamentals, design, implementation" 3rd Ed., 1988)

"A database is not a technology. Rather, it is an expression of organization, clarity, and precision. It may or may not be computerized. If it is, it may exist on a microcomputer, a minicomputer, or a large mainframe. Finally, a database may or may not be centralized. However a database is implemented and operated, success is impossible without the codification of and adherence to data semantics, which are the rules for meaning, validity, and usage." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"What is a database? It's a file - that's an electronic version of information sheets and binders. It enables you to efficiently use your information. […] Just as a binder contains information sheets, a database holds records. […] First, a database enables you to store larger amounts of information than a binder. […] Second, a database enables you to retrieve your information more efficiently, more flexibly, and more creatively. […] Third, a database can manipulate the information it contains. […] Fourth, a database enables you to print your information on Information envelopes, labels, fax cover sheets, reports, and letters." (Guy Kawasaki, "Database 101", 1991)

"When a database is computerized, it represents the automation of the knowledge component of a business, which is manifest through the business's quality operation, planning, and management With a successful database, the managers of a business can research the past, organize the present, and plan for the future." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Conceptually, the database is a set of sentences expressing propositions asserted to be true of the UoD [Universe of Discourse]. Since sentences may be added to or deleted from the database, the database may undergo transitions from one state to another. However, at any particular time, the sentences populating the database must individually and collectively conform to the application specific grammar or design plan which is the conceptual schema." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"In using a data base, first look at the metadata, then look at the data. [...] The old computer acronym GIGO (Garbage In, Garbage Out) applies to the use of large databases. The issue is whether the data from the database will answer the research question. In order to determine this, the investigator must have some idea about the nature of the data in the database - that is, the metadata." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"To be worthy of being called a database, a system must have two essential properties. 1. It holds data as an integrated system of records. 2. It contains self-describing information." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A database can be considered as an organised collection of data which is meant to represent some UoD [Universe of Discourse]. […] A database is made up of two parts: an intensional part and an extensional part. The intension of a database is a set of definitions which describe the structure or organisation of a given database. The extension of a database is the total set of data in the database. The intension of a database is also referred to as its schema." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A database in manual terms is analogous to a filing cabinet, or more accurately to a series of filing cabinets. A database is a structured repository for data. The overall purpose of such a repository is to maintain data for some set of organisational objectives. Traditionally, such objectives fall within the domain of administration. Most database systems are built to retain the data required for the running of the day-to-day activities of an organisation. […] Structure normally implies some logical division, usually hierarchical. Hence we speak of a database as being a collection of data structures. […] Each data structure in a database is in turn also a hierarchical collection of data. In manual terms, such data structures would be folders hung in a filing cabinet." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"Databases rarely begin their life empty. More often the starting point in their lifecycle is a data conversion from some previously exiting data source. And by a cruel twist of fate, it is usually a rather violent beginning. Data conversion usually takes the better half of new system implementation effort and almost never goes smoothly." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"[...] data conversion is the most difficult part of any system implementation. The error rate in a freshly populated new database is often an order of magnitude above that of the old system from which the data is converted. As a major source of the data problems, data conversion must be treated with the utmost respect it deserves." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The corporate data universe consists of numerous databases linked by countless real-time and batch data feeds. The data continuously move about and change. The databases are endlessly redesigned and upgraded, as are the programs responsible for data exchange. The typical result of this dynamic is that information systems get better, while data deteriorates. This is very unfortunate since it is the data quality that determines the intrinsic value of the data to the business and consumers. Information technology serves only as a magnifier for this intrinsic value. Thus, high quality data combined with effective technology is a great asset, but poor quality data combined with effective technology is an equally great liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"As we know, every action produces an equal and opposite reaction. Adding security reduces performance. Sharding and partitioning the database affords greater performance and distribution but creates complexity that is difficult to manage. Adding robust monitoring can generate huge volumes of log data to be stored, rotated, secured, and cleansed. Keeping the design “simple” often defers the interests of flexibility until later, where it becomes very expensive." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

26 December 2009

📉Graphical Representation: Bivariate Data (Just the Quotes)

"[...] lack of interaction in analysis of variance and [...] lack of correlation in bivariate distributions - such nullities would be quite surprising phenomena in the usual interactive complexities of social life." (Louis Guttman, "What is not what in statistics", The Statistician, 1977)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"The scatterplot is a useful exploratory method for providing a first look at bivariate data to see how they are distributed throughout the plane, for example, to see clusters of points, outliers, and so forth." (William S Cleveland, "Visualizing Data", 1993)

"The scatterplot matrix shows all pairwise (bivariate marginal) views of a set of variables in a coherent display. One analog for categorical data is a matrix of mosaic displays showing some aspect of the bivariate relation between all pairs of variables. The simplest case shows the bivariate marginal relation for each pair of variables. Another case shows the conditional relation between each pair, with all other variables partialled out. For quantitative data this represents (a) a visualization of the conditional independence relations studied by graphical models, and (b) a generalization of partial residual plots. The conditioning plot, or coplot, shows a collection of partial views of several quantitative variables, conditioned by the values of one or more other variables. A direct analog of the coplot for categorical data is an array of mosaic plots of the dependence among two or more variables, stratified by the values of one or more given variables. Each such panel then shows the partial associations among the foreground variables; the collection of such plots shows how these associations change as the given variables vary." (Michael Friendly, "Extending Mosaic Displays: Marginal, Conditional, and Partial Views of Categorical Data", 199)

"A time-series plot (sometimes also called a time plot) is a simple graph of data collected over time that can be invaluable in identifying trends or patterns that might be of interest.A time-series plot can be constructed by thinking of the data set as a bivariate data set, where y is the variable observed and x is the time at which the observation was made. These (x, y) pairs are plotted as in a scatterplot. Consecutive observations are then connected by a line segment; this aids in spotting trends over time." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Eye-catching data graphics tend to use designs that are unique (or nearly so) without being strongly focused on the data being displayed. In the world of Infovis, design goals can be pursued at the expense of statistical goals. In contrast, default statistical graphics are to a large extent determined by the structure of the data (line plots for time series, histograms for univariate data, scatterplots for bivariate nontime-series data, and so forth), with various conventions such as putting predictors on the horizontal axis and outcomes on the vertical axis. Most statistical graphs look like other graphs, and statisticians often think this is a good thing." (Andrew Gelman & Antony Unwin, "Infovis and Statistical Graphics: Different Goals, Different Looks" , Journal of Computational and Graphical Statistics Vol. 22(1), 2013)

"A correlation is simply a bivariate relationship - a fancy way of saying that there is a relationship between two ('bi') variables ('variate'). And a bivariate relationship doesn’t prove that one thing caused the other. Think of it this way: you can observe that two things appear to be related statistically, but that doesn’t tell you the answer to any of the questions you might really care about - why is there a relationship and what does it mean to us as a consumer of data?" (John H Johnson & Mike Gluck, "Everydata: The misinformation hidden in the little data you consume every day", 2016)

"One problem for visualizing multiple views is that of laying out the plots. Indeed, there are some plots, such as scatterplot matrixes and trellis displays, that are formed just by arranging simpler plots according to certain rules. Scatterplot matrices, for example, arrange scatterplots side by side so that each variable in a dataset is graphed against the other variables, with the graphs being displayed as a row or a column of the matrix. This lets the user rapidly inspect all of the bivariate relationships among the variables, permitting the detection of outliers, nonlinearities, and other features of the data." (Forrest W Young et al, "Visual Statistics: Seeing data with dynamic interactive graphics", 2016)

"Correlation does not imply causation: often some other missing third variable is influencing both of the variables you are correlating. […] The need for a scatterplot arose when scientists had to examine bivariate relations between distinct variables directly. As opposed to other graphic forms - pie charts, line graphs, and bar charts - the scatterplot offered a unique advantage: the possibility to discover regularity in empirical data (shown as points) by adding smoothed lines or curves designed to pass 'not through, but among them', so as to pass from raw data to a theory-based description, analysis, and understanding." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

25 December 2009

📉Graphical Representation: Univariate vs Multivariate Data (Just the Quotes)

"Fitting data means finding mathematical descriptions of structure in the data. An additive shift is a structural property of univariate data in which distributions differ only in location and not in spread or shape. […] The process of identifying a structure in data and then fitting the structure to produce residuals that have the same distribution lies at the heart of statistical analysis. Such homogeneous residuals can be pooled, which increases the power of the description of the variation in the data." (William S Cleveland, "Visualizing Data", 1993)

"The logarithm is one of many transformations that we can apply to univariate measurements. The square root is another. Transformation is a critical tool for visualization or for any other mode of data analysis because it can substantially simplify the structure of a set of data. For example, transformation can remove skewness toward large values, and it can remove monotone increasing spread. And often, it is the logarithm that achieves this removal." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"The only thing that is 2-dimensional about evidence is the physical flatland of paper and computer screen. Flatlandy technologies of display encourage flatlandy thinking. Reasoning about evidence should not be stuck in 2 dimensions, for the world seek to understand is profoundly multivariate. Strategies of design should make multivariateness routine, nothing out of the ordinary. To think multivariate, show multivariate; the Third Principle for the analysis and presentation of data: 'Show multivariate data; that is, show more than 1 or 2 variables.'" (Edward R Tufte, "Beautiful Evidence", 2006)

"The simplest way to plot univariate continuous data is a dotplot. Because the points are distributed along only one axis, overplotting is a serious problem, no matter how small the sample is. The usual technique to avoid overplotting is jittering, i.e., the data are randomly spread along a virtual second axis." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Multivariate techniques often summarize or classify many variables to only a few groups or factors (e.g., cluster analysis or multi-dimensional scaling). Parallel coordinate plots can help to investigate the influence of a single variable or a group of variables on the result of a multivariate procedure. Plotting the input variables in a parallel coordinate plot and selecting the features of interest of the multivariate procedure will show the influence of different input variables." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Parallel coordinate plots are often overrated concerning their ability to depict multivariate features. Scatterplots are clearly superior in investigating the relationship between two continuous variables and multivariate outliers do not necessarily stick out in a parallel coordinate plot. Nonetheless, parallel coordinate plots can help to find and understand features such as groups/clusters, outliers and multivariate structures in their multivariate context. The key feature is the ability to select and highlight individual cases or groups in the data, and compare them to other groups or the rest of the data." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Multivariate analysis refers to incorporation of multiple exploratory variables to understand the behavior of a response variable. This seems to be the most feasible and realistic approach considering the fact that entities within this world are usually interconnected. Thus the variability in response variable might be affected by the variability in the interconnected exploratory variables." (Danish Haroon, "Python Machine Learning Case Studies", 2017)

"A heatmap is a visualization where values contained in a matrix are represented as colors or color saturation. Heatmaps are great for visualizing multivariate data (data in which analysis is based on more than two variables per observation), where categorical variables are placed in the rows and columns and a numerical or categorical variable is represented as colors or color saturation." (Mario Döbler & Tim Großmann, "The Data Visualization Workshop", 2nd Ed., 2020)

23 December 2009

📉Graphical Representation: Categorical Data (Just the Quotes)

"Mosaic displays represent the counts in a contingency table by tiles whose size is proportional to the cell count. This graphical display for categorical data generalizes readily to multiway tables." (Michael Friendly, "Mosaic Displays for Loglinear Models", Proceedings of the Statistical Graphics, 1992)

"Categorical data are most often modeled using loglinear models. For certain loglinear models, mosaic plots have unique shapes that do not depend on the actual data being modeled. These shapes reflect the structure of a model, defined by the presence and absence of particular model coefficients. Displaying the expected values of a loglinear model allows one to incorporate the residuals of the model graphically and to visually judge the adequacy of the loglinear fit. This procedure leads to stepwise interactive graphical modeling of loglinear models. We show that it often results in a deeper understanding of the structure of the data. Linking mosaic plots to other inter- active displays offers additional power that allows the investigation of more complex dependence models than provided by static displays." (Martin Theus & Stephan R W Lauer, "Visualizing Loglinear Models", Journal of Computational and Graphical Statistics Vol. 8 (3), 1999)

"A graphical display of a p-dimensional contingency table, the empirical distribution of p categorical variables, is a mosaic plot. Each tile (or bin) corresponds to one cell of the contingency table, its size to the number of the cell's entries. The shape of a tile is calculated during the (strictly hierarchical) construction." (Heike Hoffmann, "Generalized Odds Ratios for Visual Modeling", Journal of Computational and Graphical Statistics Vol. 10 (4), 2001)

"Mosaic plots […] are designed to show the dependencies and interactions between multiple categorical variables in one plot. […] . A spineplot can be regarded as a kind of one-dimensional mosaic plot. […] In contrast with a barchart, where the bars are aligned to an axis, the mosaic plot uses a rectangular region, which is subdivided into tiles according to the numbers of observations falling into the different classes. This subdivision is done recursively, or in statistical terms conditionally, as more variables are included." (Antony Unwin et al [in "Graphics of Large Datasets: Visualizing a Million"], 2006)

"Shingling is the process of dividing a continuous variable into - possibly overlapping - intervals in order to convert a continuous variable into a discrete variable. Shingling is quite different from conditioning on categorical variables. Overlapping shingles/intervals lead to multiple representation of data within a trellis display, which is not the case for categorical variables. Furthermore, it is challenging to judge which intervals/cases have been chosen to build a shingle. Trellis displays represent the shingle interval visually by an interval of the strip label. Although no plotting space is wasted, the information on the intervals is difficult to read from the strip label. Despite these drawbacks, there is a valid motivation for shingling […]." (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Trellis displays introduce the concept of shingling. Shingling is the process of dividing a continuous variable into - possibly overlapping - intervals in order to convert a continuous variable into a discrete variable. Shingling is quite different from conditioning on categorical variables. Overlapping shingles/intervals lead to multiple representation of data within a trellis display, which is not the case for categorical variables. Furthermore, it is challenging to judge which intervals/cases have been chosen to build a shingle. Trellis displays represent the shingle interval visually by an interval of the strip label. Although no plotting space is wasted, the information on the intervals is difficult to read from the strip label. Despite these drawbacks, there is a valid motivation for shingling," (Martin Theus & Simon Urbanek, "Interactive Graphics for Data Analysis: Principles and Examples", 2009)

"Histograms are often mistaken for bar charts but there are important differences. Histograms show distribution through the frequency of quantitative values (y axis) against defined intervals of quantitative values(x axis). By contrast, bar charts facilitate comparison of categorical values. One of the distinguishing features of a histogram is the lack of gaps between the bars [...]" (Andy Kirk, "Data Visualization: A successful design process", 2012)

"If I had to pick a single go-to graph for categorical data, it would be the horizontal bar chart, which flips the vertical version on its side. Why? Because it is extremely easy to read. The horizontal bar chart is especially useful if your category names are long, as the text is written from left to right, as most audiences read, making your graph legible for your audience." (Cole N Knaflic, "Storytelling with Data: A Data Visualization Guide for Business Professionals", 2015)

"A heatmap is a visualization where values contained in a matrix are represented as colors or color saturation. Heatmaps are great for visualizing multivariate data" (data in which analysis is based on more than two variables per observation), where categorical variables are placed in the rows and columns and a numerical or categorical variable is represented as colors or color saturation." (Mario Döbler & Tim Großmann, "The Data Visualization Workshop", 2nd Ed., 2020)

"Visualizations that use different lengths of rectangles to show quantities are called bar charts. The rectangles in bar charts are called bars, and each bar represents a single category from a categorical variable. [...] When the bars in a bar chart are standing up, these visualizations are sometimes called column charts. Column charts and bar charts work in exactly the same way, but you might choose one over the other to fit better on a page or because it suits the data better." (Nancy Organ, "Data Visualization for People of All Ages", 2024)

SQL Troubles

Pages

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part III: Completeness)

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

🗄️Data Management: Data Quality Dimensions (Part I: Uniqueness)

🗄️Data Management: Data Quality (Part II: An Introduction)

🕋Data Warehousing: Dimension Table (Definitions)

10 January 2010

🕋Data Warehousing: Dimension Table (Definitions)

🕋Data Warehousing: Dimension (Definitions)

🕋Data Warehousing: Fact Table (Definitions)

31 December 2009

🛢DBMS: Database Design (Just the Quotes)

30 December 2009

📉Graphical Representation: Idealization (Just the Quotes)

29 December 2009

🛢DBMS: Database Management Systems (Just the Quotes)

🛢DBMS: Databases (Just the Quotes)

26 December 2009

📉Graphical Representation: Bivariate Data (Just the Quotes)

25 December 2009

📉Graphical Representation: Univariate vs Multivariate Data (Just the Quotes)

23 December 2009

📉Graphical Representation: Categorical Data (Just the Quotes)

About Me