SQL Troubles: Consistency

Showing posts with label Consistency. Show all posts

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes.

Last updated: 27-Jan-2025

[Data Management] Data quality dimensions

{def} features of data that can be measured or assessed against defined standards to determine the quality of data

captures a specific aspect of general data quality

can refer to data values or to their schema

{type} hard dimensions

dimensions that can be measured

{type} soft dimensions

dimensions that can be measured only indirectly

⇐ through interviews with data users or through any other kind of communication with users

dimensions whose measurement depends on the perception of the users of the data

{dimension} uniqueness [post]

the degree to which a value or set of values is unique within a dataset

can be determined based on a set of values supposed to be unique across the whole dataset

some systems have a artificial, respectively natural unique identified

measured in terms of either

the percentage of unique values available in a dataset
the percentage of duplicate values available in a dataset

the impossibility of identifying whether a value is unique increases the chances for it to be duplicated
it can have broader implications

aggregated information is not shown correctly

⇐ split across different entities

can lead to further duplicates in other areas

{recommendation} enforce uniqueness by design, if possible
{recommendation} check the data regularly for duplicates and disable or delete the duplicated records

⇐ one should make sure that the records can't be further reused in business processes or analytics workloads

{dimension} completeness [post]

the extent to which there are missing data in a dataset

⇐ reflected in the number of the missing values

measured as percentage of the missing values compared to the total

determined by the presence of NULL values

{type} attribute completeness

the number of NULLs in a specific attribute

{type} tuple completeness

the number of unknown values of the attributes in a tuple

{type} relation completeness

the number of tuples with unknown attribute values in the relation

{type} value completeness

makes sense for complex, semi-structured columns such as XML data type columns

e.g. a complete element or attribute can be missing

considered in report to

mandatory attributes

attributes that need a not-Null value for each record

optional attributes

attributes that not necessarily need to be provided

inapplicable attributes

attributes not applicable (relevant) for certain scenarios by design

{dimension} conformity (aka format compliance) [post]

{def} the extent data are in the expected format

dependent on the data type and its definition

can be associated with a set of metadata

data type

e.g. text, numeric, alphanumeric, positive, date

length
precision
scale
formatting patterns

e.g. phone number, decimal and digit grouping symbols
different formatting might apply based on various business rules
can use delimiters

{recommendation} define the data type and further constraints to enforce the various characteristics of the element
{recommendation} make sure that the delimiters don't overlap with other uses

{dimension} accuracy [post]

{def} the extent data is correct, respectively match the reality with an acceptable level of approximation
stricter than just conforming to business rules
can be measured at column and table level

[discrete data values]

use frequency distribution of values

a value with very low frequency is probably incorrect

[alphanumeric values]

use string length distribution

a string with a very atypical length is potentially incorrect

try to find patterns and then create pattern distribution.

patterns with low frequency probably denote wrong values

[continuous attributes]

use descriptive statistics

just by looking at minimal and maximal values, you can easily spot potentially problematic data

{dimension} consistency [post]

{def} the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component

{type} notational consistency

the extent (data) values are consistent in notation

{type} semantic consistency

the degree to which data has unique meaning
is more restrictive than the notational consistency

measures the equivalence of information stored in various repositories
involves comparing values with a predefined set of possible values

from the same or from different systems

can be measured at column and table level
can have different scopes

cross-system consistencies

among systems or data repositories

cross-record consistency

within the same repository

temporal consistency

within the same record at different points in time

{dimension} timeliness [post]

tells the degree to which data is current and available when needed

there is always some delay between change in the real world and the moment when this change is entered into a system

stale data/obsolete data

{dimension} structuredness [post]

the degree to which a data structure or model possesses a definite pattern of organization of its interdependent parts
allows the categorization of data as

structured data [def]

refers to structures that can be easily perceived or known, that raises no doubt on structure’s delimitations

unstructured data [def]

refers to textual data and media content (video, sound, images), in which the structural patterns even if exist they are hard to discover or not predefined

semi-structured data [def]

refers to islands of structured data stored with unstructured data, or vice versa

⇐ the more structured the data, the easier it is to be processed

{dimension} referential integrity [post]

{def} the degree to which the values of a key in one table (aka reference value) match the values of a key in a related table (aka the referenced value)
it's an architectural concept of the database
{recommendation} keep the referential integrity of a system by design

some systems build logic for assuring the referential integrity in the applications and not in the database

{dimension} currency (aka actuality)

the extent to which data is actual
can be considered as a special type of accuracy

⇐ when the data is not actual then it doesn’t reflect reality

{dimension} ease of use

the extent to which data can be used for a given purpose

usually it refers to whether the data can be processed as needed
depends on the application or on the user interface

{dimension} fitness of use

the degree to which the data is fit for use

the data may have good quality for a given purposes but

not usable for other purposes
can be used as substitute for other data

e.g. use phone area codes instead of ZIP codes to locate customers approximately

{dimension} trustfulness [post]

the degree to which the data can be trusted

is a matter of perception

ask users whether they trust the data and which are the reasons

if the users don’t trust the data

they will create their own solutions
they will not use applications

{dimension} entropy

{def} the average amount of information conveyed

⇐ quantification of information in a system
⇐ the more dispersed the values and the more the frequency distribution of a discrete column is equally spread among the values, the more information is available [1]
⇐ can tell whether your data is suitable for analysis or not

can be measured at column and table level

{dimension} presentation quality

applicable to applications that presents data

format and appearance should support the appropriate use of data
depends on the UI used

{recommendation} have a dedicated system for maintaining the master data and broadcast the data to the subscribers as needed

the data should be exclusively managed though the management system
{anti-pattern} data is modified in the subscribers and the changes aren't always reflected back to the source system

Previous Post <<||>> Next Post

References:
[1] Dejan Sarka et al (2012) Exam 70-463: Implementing a Data Warehouse with Microsoft SQL Server 2012 (Training Kit)

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

Data Migrations Series

Auditability

Auditability is the degree to which the solution allows checking the data for their accuracy, or for their quality in general, respectively the degree to which the DM solution and processes allow to be audited regarding compliance, security and other types of requirements. All these aspects are important in case an external sign-off from an auditor is mandatory.

Automation

Automation is the degree to which the activities within a DM can be automated. Ideally all the processes or activities should be automated, though other requirements might be impacted negatively. Ideally, one needs to find the right balance between the various requirements.

Cohesion

Cohesion is the degree to which the tasks performed by the solution, respectively during the migration, are related to each other. Given the dependencies existing between data, their processing and further project-related activities, DM imply a high degree of cohesion that need to be addressed by design.

Complexity

Complexity is the degree to which a solution is difficult to understand given the various processing layers and dependencies existing within the data. The complexity of DM revolve mainly around the data structures and the transformations needed to translate the data between the various data models.

Compliance

Compliance is the degree to which a solution is compliant with internal or external regulations that apply. There should be differentiated between mandatory requirements, respectively recommendations and other requirements.

Consistency

Consistency is the degree to which data conform to an equivalent set of data, in this case the entities considered for the DM need to be consistent to each other. A record referenced in any entity of the migration need to be considered, respectively made available in the target system(s) either by parametrization or migration.

During each iteration, the data need to remain consistent, so it can facilitate the troubleshooting. The data are usually reimported between iterations or during same iteration, typically to reflect the changes occurred in the source systems or other purposes.

Coupling

Data coupling is the degree to which different processing areas within a DM share the same data, typically a reflection of the dependencies existing between the data. Ideally, the areas should be decoupled as much as possible.

Extensibility

Extensibility is the degree to which the solution or parts of the logic can be extended to accommodate further requirements. Typically, this involves changes that deviate from the standard functionality. Extensibility impacts positively the flexibility.

Flexibility

Flexibility is the degree to which a solution can handle new requirements or ad-hoc changes to the logic. No matter how good everything was planned there’s always something forgotten or new information is identified. Having the flexibility to change code or data on the fly can make an important difference.

Integrity

Integrity is the degree to which a solution prevents the changes to data besides the ones considered by design. Users and processes should not be able modifying the data besides the agreed procedures. This means that the data need to be processed in the sequence agreed. All aspects related to data integrity need to be documented accordingly.

Interoperability

Interoperability is the degree to which a solution’s components can exchange data and use the respective data. The various layers of a DM’s solutions must be able to process the data and this should be possible by design.

Maintainability

Maintainability is the degree to which a solution can be modified to or add minor features, change existing code, corrects issues, refactor code, improve performance or address changes in environment. The data required and the transformation rules are seldom known in advance. The data requirements are definitized during the various iterations, the changes needing to be implemented as the iterations progress. Thus, maintainability is a critical requirement.

Previous Post - Next Post

13 September 2020

🎓Knowledge Management: Definitions II (What's in a Name)

Browsing through the various books on databases and programming appeared over the past 20-30 years, it’s probably hard not to notice the differences between the definitions given even for straightforward and basic concepts like the ones of view, stored procedure or function. Quite often the definitions lack precision and rigor, are circular and barely differentiate the defined term (aka concept) from other terms. In addition, probably in the attempt of making the definitions concise, important definitory characteristics are omitted.

Unfortunately, the same can be said about other non-scientific books, where the lack of appropriate definitions make the understanding of the content and presented concepts more difficult. Even if the reader can arrive in time to an approximate understanding of what is meant, one might have the feeling that builds castles in the air as long there is no solid basis to build upon – and that should be the purpose of a definition – to offer the foundation on which the reader can build upon. Especially for the readers coming from the scientific areas this lack of appropriateness and moreover, the lack of definitions, feels maybe more important than for the professional who already mastered the respective areas.

In general, a definition of a term is a well-defined descriptive statement which serves to differentiate it from related concepts. A well-defined definition should be meaningful, explicit, concise, precise, non-circular, distinct, context-dependent, relevant, rigorous, and rooted in common sense. In addition, each definition needs to be consistent through all the content and when possible, consistent with the other definitions provided. Ideally the definitions should cover as much of possible from the needed foundation and provide a unitary consistent multilayered non-circular and hierarchical structure that facilitates the reading and understanding of the given material.

Thus, one can consider the following requirements for a definition:

Meaningful: the description should be worthwhile and convey the required meaning for understanding the concept.

Explicit: the description must state clearly and provide enough information/detail so it can leave no room for confusion or doubt.

Context-dependent: the description should provide upon case the context in which the term is defined.

Concise: the description should be as succinct as possible – obtaining the maximum of understanding from a minimum of words.

Precise: the description should be made using unambiguous words that provide the appropriate meaning individually and as a whole.

Intrinsic non-circularity: requires that the term defined should not be used as basis for definitions, leading thus to trivial definitions like “A is A”.

Distinct: the description should provide enough detail to differentiate the term from other similar others.

Relevant: the description should be closely connected or appropriate to what is being discussed or presented.

Rigorous: the descriptions should be the result of a thorough and careful thought process in which the multiple usages and forms are considered.

Extrinsic non-circularity: requires that the definitions of two distinct terms should not be circular (e.g. term A’s definition is based on B, while B’s definition is based on A), situation usually met occasionally in dictionaries.

Rooted in common sense: the description should not deviate from the common-sense acceptance of the terms used, typically resulted from socially constructed or dictionary-based definitions.

Unitary consistent multilayered hierarchical structure: the definitions should be given in an evolutive structure that facilitates learning, typically in the order in which the concepts need to be introduced without requiring big jumps in understanding. Even if concepts have in general a networked structure, hierarchies can be determined, especially based on the way concepts use other concepts in their definitions. In addition, the definitions must be consistent – hold together – respectively be unitary – form a whole.

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part II: The Import Layer)

Data Migrations Series

A data migration involves the mapping between two data models at (data) entity level, where the entity is a data abstraction modelling a business entity (e.g. Products, Vendors, Customers, Sales Orders, Purchase Orders, etc.). Thus, the Products business entity from the source will be migrated to a similar entity into the target. Ideally, the work would be simplified if the two models would provide direct access to the data through entities. Unfortunately, this is seldom the case, the entities being normalized and thus broken into several tables, with important structural differences.

Therefore, the first step within a DM is identifying the business entities that make its scope from source and target, and providing a mapping between their attributes which will define how the data will flow between source and target.

In theory, the source entity could be defined directly into the source with the help of views, if they are not already available. The problem with this approach is that the base data change, fact that can easily lead to inconsistencies between the various steps of the migration. For example, records are added, deleted, inactivated, or certain attributes are changed, fact that can easily make troubleshooting and validation a nightmare.

The easiest way to address this is by assuring that the data will change only when actually needed. Is needed thus to create a snapshot of the data and work with it. Snapshots can become however costly for the performance of the source database, as they involve an additional maintenance overhead. Another solution is to make the snapshot via a backup or by copying the data via ETL functionality into another database (aka migration database). Considering that the data in scope make a small subset, a backup is usually costly as storage space and time, and is not always possible to take a backup when needed.

An ETL-based solution for this provides an acceptable performance and is flexible enough to address all important types of requirements. The data can be accessed directly from the source (pull mechanism) or, when the direct access is an issue, they could be pushed to the migration database (push mechanism) or made available for load to a given location, then imported it into the migration database (hybrid mechanism). There’s also the possibility to integrate the migration database when a publisher/subscriber mechanism is in place, however such solutions raise other types of issues.

One can import the tables 1:1 from the source for the entities in scope, attempt directly to model the entity within the ETL jobs or find a solution in between (e.g. import the base tables while considering joining also the dropdown tables). The latter seems to provide the best approach because it minimizes the numbers of tables to be imported while still reflecting the data structures from the source. An entity-based import addresses the first but not the second aspect, though depending on the requests it can work as well.

In Data Warehousing (DW) there’s the practice to load the data into staging tables with no constraints on them, and only when the load is complete to move the data into the base tables which will be used as source for the further processing. This approach assures that the data are loaded completely and that the unavailability of the base tables is limited. In contrast to DW solutions is ideally not to perform any transformations on the data, as they should reflect the quality characteristics from the source. It's ideal to keep the data extraction, respectively the ETL jobs as simple as possible and resist building the migration logic already into this layer.

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

In data integrations, data migrations and data warehousing there is the need to move data between two or more systems. In the simplest scenario there are only two systems involved, a source and a target system, though there can be complex scenarios in which data from multiple sources need to be available in a common target system (as in the case of data warehouses/marts or data migrations), or data from one source (e.g. ERP systems) need to be available in other systems (e.g. Web shops, planning systems), or there can be complex cases in which there is a many-to-many relationship (e.g. data from two ERP systems are consolidated in other systems).

The data can flow in one direction from the source systems to the target systems (aka unidirectional flow), though there can be situations in which once the data are modified in the target system they need to flow back to the source system (aka bidirectional flow), as in the case of planning or product development systems. In complex scenarios the communication may occur multiple times within same process until a final state is reached.

Independently of the number of systems and the type of communication involved, data need to flow between the systems as smooth as possible, assuring that the data are consistent between the various systems and available when needed. The architectures responsible for moving data between the sources are based on two simple mechanisms - push vs pull – or combinations of them.

A push mechanism makes data to be pushed from the source system into the target system(s), the source system being responsible for the operation. Typically the push can happen as soon as an event occurs in the source system, event that leads to or follows a change in the data. There can be also cases when is preferred to push the data at regular points in time (e.g. hourly, daily), especially when the changes aren’t needed immediately. This later scenario allows to still make changes to the data in the source until they are sent to other system(s). When the ability to make changes is critical this can be controlled over specific business rules.

A pull mechanism makes the data to be pulled from the source system into the target system, the target systems being responsible for the operation. This usually happens at regular points in time or on demand, however the target system has to check whether the data have been changed.

Hybrid scenarios may involve a middleware that sits between the systems, being responsible for pulling the data from the source systems and pushing them into the targets system. Another hybrid scenario is when the source system pushes the data to an intermediary repository, the target system(s) pulling the data on a need basis. The repository can reside on the source, target on in-between. A variation of it is when the source informs the target that a change happened at it’s up to the target to decide whether it needs the data or not.

The main differentiators between the various methods is the timeliness, completeness and consistency of the data. Timeliness refers to the urgency with which data need to be available in the target system(s), completeness refers to the degree to which the data are ready to be sent, while consistency refers to the degree the data from the source are consistent with the data from the target systems.

Based on their characteristics integrations seem to favor push methods while data migrations and data warehousing the pull methods, though which method suits the best depends entirely on the business needs under consideration.

Previous Post <<||>> Next Post

24 December 2018

🔭Data Science: Models (Just the Quotes)

"A model, like a novel, may resonate with nature, but it is not a ‘real’ thing. Like a novel, a model may be convincing - it may ‘ring true’ if it is consistent with our experience of the natural world. But just as we may wonder how much the characters in a novel are drawn from real life and how much is artifice, we might ask the same of a model: How much is based on observation and measurement of accessible phenomena, how much is convenience? Fundamentally, the reason for modeling is a lack of full access, either in time or space, to the phenomena of interest." (Kenneth Belitz, Science, Vol. 263, 1944)

"The principle of complementarity states that no single model is possible which could provide a precise and rational analysis of the connections between these phenomena [before and after measurement]. In such a case, we are not supposed, for example, to attempt to describe in detail how future phenomena arise out of past phenomena. Instead, we should simply accept without further analysis the fact that future phenomena do in fact somehow manage to be produced, in a way that is, however, necessarily beyond the possibility of a detailed description. The only aim of a mathematical theory is then to predict the statistical relations, if any, connecting the phenomena." (David Bohm, "A Suggested Interpretation of the Quantum Theory in Terms of ‘Hidden’ Variables", 1952)

"Consistency and completeness can also be characterized in terms of models: a theory T is consistent if and only if it has at least one model; it is complete if and only if every sentence of T which is satified in one model is also satisfied in any other model of T. Two theories T1 and T2 are said to be compatible if they have a common consistent extension; this is equivalent to saying that the union of T1 and T2 is consistent." (Alfred Tarski et al, "Undecidable Theories", 1953)

"The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work" (John Von Neumann, "Method in the Physical Sciences", 1955)

"[…] no models are [true] = not even the Newtonian laws. When you construct a model you leave out all the details which you, with the knowledge at your disposal, consider inessential. […] Models should not be true, but it is important that they are applicable, and whether they are applicable for any given purpose must of course be investigated. This also means that a model is never accepted finally, only on trial." (Georg Rasch, "Probabilistic Models for Some Intelligence and Attainment Tests", 1960)

"[...] the null-hypothesis models [...] share a crippling flaw: in the real world the null hypothesis is almost never true, and it is usually nonsensical to perform an experiment with the sole aim of rejecting the null hypothesis." (Jum Nunnally, "The place of statistics in psychology", Educational and Psychological Measurement 20, 1960)

"If one technique of data analysis were to be exalted above all others for its ability to be revealing to the mind in connection with each of many different models, there is little doubt which one would be chosen. The simple graph has brought more information to the data analyst’s mind than any other device. It specializes in providing indications of unexpected phenomena." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics Vol. 33 (1), 1962)

"A model is essentially a calculating engine designed to produce some output for a given input." (Richard C Lewontin, "Models, Mathematics and Metaphors", Synthese, Vol. 15, No. 2, 1963)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems." (Richard Levins, "The Strategy of Model Building in Population Biology", 1966)

"Models are to be used, but not to be believed." (Henri Theil, "Principles of Econometrics", 1971)

"A theory has only the alternative of being right or wrong. A model has a third possibility: it may be right, but irrelevant." (Manfred Eigen, 1973)

"A model is an abstract description of the real world. It is a simple representation of more complex forms, processes and functions of physical phenomena and ideas." (Moshe F Rubinstein & Iris R Firstenberg, "Patterns of Problem Solving", 1975)

"A model is an attempt to represent some segment of reality and explain, in a simplified manner, the way the segment operates." (E Frank Harrison, "The managerial decision-making process", 1975)

"The value of a model lies in its substitutability for the real system for achieving an intended purpose." (David I Cleland & William R King, "Systems analysis and project management" , 1975)

"For the theory-practice iteration to work, the scientist must be, as it were, mentally ambidextrous; fascinated equally on the one hand by possible meanings, theories, and tentative models to be induced from data and the practical reality of the real world, and on the other with the factual implications deducible from tentative theories, models and hypotheses." (George E P Box, "Science and Statistics", Journal of the American Statistical Association 71, 1976)

"Mathematical models are more precise and less ambiguous than quantitative models and are therefore of greater value in obtaining specific answers to certain managerial questions." (Henry L Tosi & Stephen J Carrol, "Management", 1976)

"The aim of the model is of course not to reproduce reality in all its complexity. It is rather to capture in a vivid, often formal, way what is essential to understanding some aspect of its structure or behavior." (Joseph Weizenbaum, "Computer power and human reason: From judgment to calculation" , 1976)

"Models, of course, are never true, but fortunately it is only necessary that they be useful. For this it is usually needful only that they not be grossly wrong. I think rather simple modifications of our present models will prove adequate to take account of most realities of the outside world. The difficulties of computation which would have been a barrier in the past need not deter us now." (George E P Box, "Some Problems of Statistics and Everyday Life", Journal of the American Statistical Association, Vol. 74 (365), 1979)

"The purpose of models is not to fit the data but to sharpen the questions." (Samuel Karlin, 1983)

"The connection between a model and a theory is that a model satisfies a theory; that is, a model obeys those laws of behavior that a corresponding theory explicity states or which may be derived from it. [...] Computers make possible an entirely new relationship between theories and models. [...] A theory written in the form of a computer program is [...] both a theory and, when placed on a computer and run, a model to which the theory applies." (Joseph Weizenbaum, "Computer Power and Human Reason", 1984)

“There are those who try to generalize, synthesize, and build models, and there are those who believe nothing and constantly call for more data. The tension between these two groups is a healthy one; science develops mainly because of the model builders, yet they need the second group to keep them honest.” (Andrew Miall, “Principles of Sedimentary Basin Analysis”, 1984)

"Competent scientists do not believe their own models or theories, but rather treat them as convenient fictions. [...] The issue to a scientist is not whether a model is true, but rather whether there is another whose predictive power is enough better to justify movement from today’s fiction to a new one." (Steve Vardeman, "Comment", Journal of the American Statistical Association 82, 1987)

"Models are often used to decide issues in situations marked by uncertainty. However statistical differences from data depend on assumptions about the process which generated these data. If the assumptions do not hold, the inferences may not be reliable either. This limitation is often ignored by applied workers who fail to identify crucial assumptions or subject them to any kind of empirical testing. In such circumstances, using statistical procedures may only compound the uncertainty." (David A Greedman & William C Navidi, "Regression Models for Adjusting the 1980 Census", Statistical Science Vol. 1 (1), 1986)

"The fact that [the model] is an approximation does not necessarily detract from its usefulness because models are approximations. All models are wrong, but some are useful." (George Box, 1987)

"A theory is a good theory if it satisfies two requirements: it must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations." (Stephen Hawking, "A Brief History of Time: From Big Bang To Black Holes", 1988)

"[…] no good model ever accounted for all the facts, since some data was bound to be misleading if not plain wrong. A theory that did fit all the data would have been ‘carpentered’ to do this and would thus be open to suspicion." (Francis H C Crick, "What Mad Pursuit: A Personal View of Scientific Discovery", 1988)

"A model is generally more believable if it can predict what will happen, rather than 'explain' something that has already occurred. […] Model building is not so much the safe and cozy codification of what we are confident about as it is a means of orderly speculation." (James R Thompson, "Empirical Model Building", 1989)

"Model is used as a theory. It becomes theory when the purpose of building a model is to understand the mechanisms involved in the developmental process. Hence as theory, model does not carve up or change the world, but it explains how change takes place and in what way or manner. This leads to build change in the structures." (Laxmi K Patnaik, "Model Building in Political Science", The Indian Journal of Political Science Vol. 50 (2), 1989)

"When evaluating a model, at least two broad standards are relevant. One is whether the model is consistent with the data. The other is whether the model is consistent with the ‘real world’." (Kenneth A Bollen, "Structural Equations with Latent Variables", 1989)

"Statistical models are sometimes misunderstood in epidemiology. Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model." (Scott Zeger, "Statistical reasoning in epidemiology", American Journal of Epidemiology, 1991)

"No one has ever shown that he or she had a free lunch. Here, of course, 'free lunch' means 'usefulness of a model that is locally easy to make inferences from'. (John Tukey, "Issues relevant to an honest account of data-based inference, partially in the light of Laurie Davies’ paper", 1993)

"Model building is the art of selecting those aspects of a process that are relevant to the question being asked. As with any art, this selection is guided by taste, elegance, and metaphor; it is a matter of induction, rather than deduction. High science depends on this art." (John H Holland, "Hidden Order: How Adaptation Builds Complexity", 1995)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"A good model makes the right strategic simplifications. In fact, a really good model is one that generates a lot of understanding from focusing on a very small number of causal arrows." (Robert M Solow, "How Did Economics Get That Way and What Way Did It Get?", Daedalus, Vol. 126, No. 1, 1997)

"A model is a deliberately simplified representation of a much more complicated situation. […] The idea is to focus on one or two causal or conditioning factors, exclude everything else, and hope to understand how just these aspects of reality work and interact." (Robert M Solow, "How Did Economics Get That Way and What Way Did It Get?", Daedalus, Vol. 126, No. 1, 1997)

"We do not learn much from looking at a model - we learn more from building the model and manipulating it. Just as one needs to use or observe the use of a hammer in order to really understand its function, similarly, models have to be used before they will give up their secrets. In this sense, they have the quality of a technology - the power of the model only becomes apparent in the context of its use." (Margaret Morrison & Mary S Morgan, "Models as mediating instruments", 1999)

"Building statistical models is just like this. You take a real situation with real data, messy as this is, and build a model that works to explain the behavior of real data." (Martha Stocking, New York Times, 2000)

"As I left consulting to go back to the university, these were the perceptions I had about working with data to find answers to problems: (a) Focus on finding a good solution–that’s what consultants get paid for. (b) Live with the data before you plunge into modelling. (c) Search for a model that gives a good solution, either algorithmic or data. (d) Predictive accuracy on test sets is the criterion for how good the model is. (e) Computers are an indispensable partner." (
Leo Breiman, "Statistical Modeling: The Two Cultures", Statistical Science Vol. 16(3), 2001)

"The goals in statistics are to use data to predict and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet what kind of model should be used to solve problems involving data. To make my position clear, I am not against models per se. In some situations they are the most appropriate way to solve the problem. But the emphasis needs to be on the problem and on the data. Unfortunately, our field has a vested interest in models, come hell or high water." (Leo Breiman, "Statistical Modeling: The Two Cultures, Statistical Science" Vol. 16(3), 2001)

"The point of a model is to get useful information about the relation between the response and predictor variables. Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables; neither does it have to be a data model. The goal is not interpretability, but accurate information." (Leo Breiman, "Statistical Modeling: The Two Cultures, Statistical Science" Vol. 16(3), 2001)

"A good way to evaluate a model is to look at a visual representation of it. After all, what is easier to understand - a table full of mathematical relationships or a graphic displaying a decision tree with all of its splits and branches?" (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Models can be viewed and used at three levels. The first is a model that fits the data. A test of goodness-of-fit operates at this level. This level is the least useful but is frequently the one at which statisticians and researchers stop. For example, a test of a linear model is judged good when a quadratic term is not significant. A second level of usefulness is that the model predicts future observations. Such a model has been called a forecast model. This level is often required in screening studies or studies predicting outcomes such as growth rate. A third level is that a model reveals unexpected features of the situation being described, a structural model, [...] However, it does not explain the data." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"Ockham's Razor in statistical analysis is used implicitly when models are embedded in richer models -for example, when testing the adequacy of a linear model by incorporating a quadratic term. If the coefficient of the quadratic term is not significant, it is dropped and the linear model is assumed to summarize the data adequately." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"A smaller model with fewer covariates has two advantages: it might give better predictions than a big model and it is more parsimonious (simpler). Generally, as you add more variables to a regression, the bias of the predictions decreases and the variance increases. Too few covariates yields high bias; this called underfitting. Too many covariates yields high variance; this called overfitting. Good predictions result from achieving a good balance between bias and variance. […] finding a good model involves trading of fit and complexity." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"[…] studying methods for parametric models is useful for two reasons. First, there are some cases where background knowledge suggests that a parametric model provides a reasonable approximation. […] Second, the inferential concepts for parametric models provide background for understanding certain nonparametric methods." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"I have often thought that outliers contain more information than the model." (Arnold Goodman, [Joint Statistical Meetings] 2005)

"Sometimes the most important fit statistic you can get is ‘convergence not met’ - it can tell you something is wrong with your model." (Oliver Schabenberger, "Applied Statistics in Agriculture Conference", 2006)

"Effective models require a real world that has enough structure so that some of the details can be ignored. This implies the existence of solid and stable building blocks that encapsulate key parts of the real system’s behavior. Such building blocks provide enough separation from details to allow modeling to proceed."(John H. Miller & Scott E. Page, "Complex Adaptive Systems: An Introduction to Computational Models of Social Life", 2007)

"In science we try to explain reality by using models (theories). This is necessary because reality itself is too complex. So we need to come up with a model for that aspect of reality we want to understand – usually with the help of mathematics. Of course, these models or theories can only be simplifications of that part of reality we are looking at. A model can never be a perfect description of reality, and there can never be a part of reality perfectly mirroring a model." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is also inevitable for any model or theory to have an uncertainty (a difference between model and reality). Such uncertainties apply both to the numerical parameters of the model and to the inadequacy of the model as well. Because it is much harder to get a grip on these types of uncertainties, they are disregarded, usually." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Outliers or flyers are those data points in a set that do not quite fit within the rest of the data, that agree with the model in use. The uncertainty of such an outlier is seemingly too small. The discrepancy between outliers and the model should be subject to thorough examination and should be given much thought. Isolated data points, i.e., data points that are at some distance from the bulk of the data are not outliers if their values are in agreement with the model in use." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"What should be the distribution of random effects in a mixed model? I think Gaussian is just fine, unless you are trying to write a journal paper." (Terry Therneau, "Speaking at useR", 2007)

"You might say that there’s no reason to bother with model checking since all models are false anyway. I do believe that all models are false, but for me the purpose of model checking is not to accept or reject a model, but to reveal aspects of the data that are not captured by the fitted model." (Andrew Gelman, "Some thoughts on the sociology of statistics", 2007)

"A model is a good model if it:1. Is elegant 2. Contains few arbitrary or adjustable elements 3. Agrees with and explains all existing observations 4. Makes detailed predictions about future observations that can disprove or falsify the model if they are not borne out." (Stephen Hawking & Leonard Mlodinow, "The Grand Design", 2010)

"In other words, the model is terrific in all ways other than the fact that it is totally useless. So why did we create it? In short, because we could: we have a data set, and a statistical package, and add the former to the latter, hit a few buttons and voila, we have another paper." (Andew J Vickers & Angel M Cronin, "Everything you always wanted to know about evaluating prediction models (but were too afraid to ask)", Urology 76(6), 2010)

"Darn right, graphs are not serious. Any untrained, unsophisticated, non-degree-holding civilian can display data. Relying on plots is like admitting you do not need a statistician. Show pictures of the numbers and let people make their own judgments? That can be no better than airing your statistical dirty laundry. People need guidance; they need to be shown what the data are supposed to say. Graphics cannot do that; models can." (William M Briggs, Comment, Journal of Computational and Graphical Statistics Vol. 20(1), 2011)

"In general, when building statistical models, we must not forget that the aim is to understand something about the real world. Or predict, choose an action, make a decision, summarize evidence, and so on, but always about the real world, not an abstract mathematical world: our models are not the reality - a point well made by George Box in his oft-cited remark that "all models are wrong, but some are useful". (David Hand, "Wonderful examples, but let's not close our eyes", Statistical Science 29, 2014)

"Things which ought to be expected can seem quite extraordinary if you’ve got the wrong model." (David Hand, "Significance", 2014)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"The crucial concept that brings all of this together is one that is perhaps as rich and suggestive as that of a paradigm: the concept of a model. Some models are concrete, others are abstract. Certain models are fairly rigid; others are left somewhat unspecified. Some models are fully integrated into larger theories; others, or so the story goes, have a life of their own. Models of experiment, models of data, models in simulations, archeological modeling, diagrammatic reasoning, abductive inferences; it is difficult to imagine an area of scientific investigation, or established strategies of research, in which models are not present in some form or another. However, models are ultimately understood, there is no doubt that they play key roles in multiple areas of the sciences, engineering, and mathematics, just as models are central to our understanding of the practices of these fields, their history and the plethora of philosophical, conceptual, logical, and cognitive issues they raise." (Otávio Bueno, [in" Springer Handbook of Model-Based Science", Ed. by Lorenzo Magnani & Tommaso Bertolotti, 2017])

"The different classes of models have a lot to learn from each other, but the goal of full integration has proven counterproductive. No model can be all things to all people." (Olivier Blanchard, "On the future of macroeconomic models", Oxford Review of Economic Policy Vol. 34 (1–2), 2018)

"Bad data makes bad models. Bad models instruct people to make ineffective or harmful interventions. Those bad interventions produce more bad data, which is fed into more bad models." (Cory Doctorow, "Machine Learning’s Crumbling Foundations", 2021)

"On a final note, we would like to stress the importance of design, which often does not receive the attention it deserves. Sometimes, the large number of modeling options for spatial analysis may raise the false impression that design does not matter, and that a sophisticated analysis takes care of everything. Nothing could be further from the truth." (Hans-Peter Piepho et al, "Two-dimensional P-spline smoothing for spatial analysis of plant breeding trials", “Biometrical Journal”, 2022)

08 November 2018

🔭Data Science - Consistency (Just the Quotes)

"To be useful data must be consistent - they must reflect periodic recordings of the value of the variable or at least possess logical internal connections. The definition of the variable under consideration cannot change during the period of measurement or enumeration. Also, if the data are to be valuable, they must be relevant to the question to be answered." (Cecil H Meyers, "Handbook of Basic Graphs: A modern approach", 1970)

"The word theory, as used in the natural sciences, doesn’t mean an idea tentatively held for purposes of argument - that we call a hypothesis. Rather, a theory is a set of logically consistent abstract principles that explain a body of concrete facts. It is the logical connections among the principles and the facts that characterize a theory as truth. No one element of a theory [...] can be changed without creating a logical contradiction that invalidates the entire system. Thus, although it may not be possible to substantiate directly a particular principle in the theory, the principle is validated by the consistency of the entire logical structure." (Alan Cromer, "Uncommon Sense: The Heretical Nature of Science", 1993)

"For a given dataset there is not a great deal of advice which can be given on content and context. hose who know their own data should know best for their specific purposes. It is advisable to think hard about what should be shown and to check with others if the graphic makes the desired impression. Design should be let to designers, though some basic guidelines should be followed: consistency is important (sets of graphics should be in similar style and use equivalent scaling); proximity is helpful (place graphics on the same page, or on the facing page, of any text that refers to them); and layout should be checked (graphics should be neither too small nor too large and be attractively positioned relative to the whole page or display)."(Antony Unwin, "Good Graphics?" [in "Handbook of Data Visualization"], 2008)

"It is the consistency of the information that matters for a good story, not its completeness. Indeed, you will often find that knowing little makes it easier to fit everything you know into a coherent pattern." (Daniel Kahneman, "Thinking, Fast and Slow", 2011)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"The danger of overfitting is particularly severe when the training data is not a perfect gold standard. Human class annotations are often subjective and inconsistent, leading boosting to amplify the noise at the expense of the signal. The best boosting algorithms will deal with overfitting though regularization. The goal will be to minimize the number of non-zero coefficients, and avoid large coefficients that place too much faith in any one classifier in the ensemble." (Steven S Skiena, "The Data Science Design Manual", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

19 January 2017

⛏️Data Management: Consistency (Definitions)

"The degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component." (IEEE," IEEE Standard Glossary of Software Engineering Terminology", 1990)

"Describes whether or not master data is defined and used across all IT systems in a consistent manner." (Allen Dreibelbis et al, "Enterprise Master Data Management", 2008)

"The requirement that a transaction should leave the database in a consistent state. If a transaction would put the database in an inconsistent state, the transaction is canceled." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"The degree to which one set of attribute values match another attribute set within the same row or record (record-level consistency), within another attribute set in a different record (cross-record consistency), or within the same record at different points in time (temporal consistency)." (DAMA International, "The DAMA Dictionary of Data Management" 1st Ed., 2010)

"Consistency is a dimension of data quality. As used in the DQAF, consistency can be thought of as the absence of variety or change. Consistency is the degree to which data conform to an equivalent set of data, usually a set produced under similar conditions or a set produced by the same process over time." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The degree to which data values are equivalent across redundant databases. With regard to transactions, consistency refers to the state of the data both before and after the transaction is executed. A transaction maintains the consistency of the state of the data. In other words, after a transaction is run, all data in the database is 'correct' (the C in ACID)." (Craig S Mullins, "Database Administration", 2012)

"Agreement of several versions of the data related to the same real objects, which are stored in various information systems." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"Consistency: agreement of several versions of the data related to the same real objects, which are stored in various information systems." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"The degree to which the data reflects the definition of the data. An example is the person name field, which represents either a first name, last name, or a combination of first name and last name." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which the model is free of logical or semantic contradictions." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of data being free of contradictions." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"The degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a component or system." [IEEE 610]

17 December 2014

🕸Systems Engineering: Coherence (Just the Quotes)

"Principles taken upon trust, consequences lamely deduced from them, want of coherence in the parts, and of evidence in the whole, these are every where to be met with in the systems of the most eminent philosophers, and seem to have drawn disgrace upon philosophy itself." (David Hume, "A Treatise of Human Nature", 1739-40)

"A system is said to be coherent if every fact in the system is related every other fact in the system by relations that are not merely conjunctive. A deductive system affords a good example of a coherent system." (Lizzie S Stebbing, "A modern introduction to logic", 1930)

"Even these humble objects reveal that our reality is not a mere collocation of elemental facts, but consists of units in which no part exists by itself, where each part points beyond itself and implies a larger whole. Facts and significance cease to be two concepts belonging to different realms, since a fact is always a fact in an intrinsically coherent whole. We could solve no problem of organization by solving it for each point separately, one after the other; the solution had to come for the whole. Thus we see how the problem of significance is closely bound up with the problem of the relation between the whole and its parts. It has been said: The whole is more than the sum of its parts. It is more correct to say that the whole is something else than the sum of its parts, because summing is a meaningless procedure, whereas the whole-part relationship is meaningful." (Kurt Koffka, "Principles of Gestalt Psychology", 1935)

"[…] reality is a system, completely ordered and fully intelligible, with which thought in its advance is more and more identifying itself. We may look at the growth of knowledge […] as an attempt by our mind to return to union with things as they are in their ordered wholeness. […] and if we take this view, our notion of truth is marked out for us. Truth is the approximation of thought to reality […] Its measure is the distance thought has travelled […] toward that intelligible system […] The degree of truth of a particular proposition is to be judged in the first instance by its coherence with experience as a whole, ultimately by its coherence with that further whole, all comprehensive and fully articulated, in which thought can come to rest." (Brand Blanshard, "The Nature of Thought" Vol. II, 1939)

"We cannot define truth in science until we move from fact to law. And within the body of laws in turn, what impresses us as truth is the orderly coherence of the pieces. They fit together like the characters of a great novel, or like the words of a poem. Indeed, we should keep that last analogy by us always, for science is a language, and like a language it defines its parts by the way they make up a meaning. Every word in a sentence has some uncertainty of definition, and yet the sentence defines its own meaning and that of its words conclusively. It is the internal unity and coherence of science which gives it truth, and which makes it a better system of prediction than any less orderly language." (Jacob Bronowski, "The Common Sense of Science", 1953)

"In our definition of system we noted that all systems have interrelationships between objects and between their attributes. If every part of the system is so related to every other part that any change in one aspect results in dynamic changes in all other parts of the total system, the system is said to behave as a whole or coherently. At the other extreme is a set of parts that are completely unrelated: that is, a change in each part depends only on that part alone. The variation in the set is the physical sum of the variations of the parts. Such behavior is called independent or physical summativity." (Arthur D Hall & Robert E Fagen, "Definition of System", General Systems Vol. 1, 1956)

"The essential vision of reality presents us not with fugitive appearances but with felt patterns of order which have coherence and meaning for the eye and for the mind. Symmetry, balance and rhythmic sequences express characteristics of natural phenomena: the connectedness of nature - the order, the logic, the living process. Here art and science meet on common ground." (Gyorgy Kepes, "The New Landscape: In Art and Science", 1956)

"Within the confines of my abstraction, for instance, it is clear that the problem of truth and validity cannot be solved completely, if what we mean by the truth of an image is its correspondence with some reality in the world outside it. The difficulty with any correspondence theory of truth is that images can only be compared with images. They can never be compared with any outside reality. The difficulty with the coherence theory of truth, on the other hand, is that the coherence or consistency of the image is simply not what we mean by its truth." (Kenneth E Boulding, "The Image: Knowledge in life and society", 1956)

"Self-organization can be defined as the spontaneous creation of a globally coherent pattern out of local interactions. Because of its distributed character, this organization tends to be robust, resisting perturbations. The dynamics of a self-organizing system is typically non-linear, because of circular or feedback relations between the components. Positive feedback leads to an explosive growth, which ends when all components have been absorbed into the new configuration, leaving the system in a stable, negative feedback state. Non-linear systems have in general several stable states, and this number tends to increase (bifurcate) as an increasing input of energy pushes the system farther from its thermodynamic equilibrium." (Francis Heylighen, "The Science Of Self-Organization And Adaptivity", 1970)

"To adapt to a changing environment, the system needs a variety of stable states that is large enough to react to all perturbations but not so large as to make its evolution uncontrollably chaotic. The most adequate states are selected according to their fitness, either directly by the environment, or by subsystems that have adapted to the environment at an earlier stage. Formally, the basic mechanism underlying self-organization is the (often noise-driven) variation which explores different regions in the system’s state space until it enters an attractor. This precludes further variation outside the attractor, and thus restricts the freedom of the system’s components to behave independently. This is equivalent to the increase of coherence, or decrease of statistical entropy, that defines self-organization." (Francis Heylighen, "The Science Of Self-Organization And Adaptivity", 1970)

"Early scientific thinking was holistic, but speculative - the modern scientific temper reacted by being empirical, but atomistic. Neither is free from error, the former because it replaces factual inquiry with faith and insight, and the latter because it sacrifices coherence at the altar of facticity. We witness today another shift in ways of thinking: the shift toward rigorous but holistic theories. This means thinking in terms of facts and events in the context of wholes, forming integrated sets with their own properties and relationships."(Ervin László, "Introduction to Systems Philosophy", 1972)

"When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. [...] If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indefinitely around the loops and process may not converges to a stable equilibrium. […] Such oscillations do not normally occur in probabilistic networks […] which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference", 1988)

"There are a variety of swarm topologies, but the only organization that holds a genuine plurality of shapes is the grand mesh. In fact, a plurality of truly divergent components can only remain coherent in a network. No other arrangement-chain, pyramid, tree, circle, hub-can contain true diversity working as a whole. This is why the network is nearly synonymous with democracy or the market." (Kevin Kelly, "Out of Control: The New Biology of Machines, Social Systems and the Economic World", 1995)

"Falling between order and chaos, the moment of complexity is the point at which self-organizing systems emerge to create new patterns of coherence and structures of behaviour." (Mark C Taylor, "The Moment of Complexity: Emerging Network Culture", 2001)

"The word 'coherence' literally means holding or sticking together, but it is usually used to refer to a system, an idea, or a worldview whose parts fit together in a consistent and efficient way. Coherent things work well: A coherent worldview can explain almost anything, while an incoherent worldview is hobbled by internal contradictions. [...] Whenever a system can be analyzed at multiple levels, a special kind of coherence occurs when the levels mesh and mutually interlock." (Jonathan Haidt,"The Happiness Hypothesis: Finding Modern Truth in Ancient Wisdom", 2006)

"A system is an interconnected set of elements that is coherently organized in a way that achieves something." (Donella H Meadows, "Thinking in Systems: A Primer", 2008)

"A worldview must be coherent, logical and adequate. Coherence means that the fundamental ideas constituting the worldview must be seen as proceeding from a single, unifying, overarching concept. A logical worldview means simply that the various ideas constituting it should not be contradictory. Adequate means that it is capable of explaining, logically and coherently, every element of contemporary experience." (M G Jackson, "Transformative Learning for a New Worldview: Learning to Think Differently", 2008)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"Even more important is the way complex systems seem to strike a balance between the need for order and the imperative for change. Complex systems tend to locate themselves at a place we call 'the edge of chaos'. We imagine the edge of chaos as a place where there is enough innovation to keep a living system vibrant, and enough stability to keep it from collapsing into anarchy. It is a zone of conflict and upheaval, where the old and new are constantly at war. Finding the balance point must be a delicate matter - if a living system drifts too close, it risks falling over into incoherence and dissolution; but if the system moves too far away from the edge, it becomes rigid, frozen, totalitarian. Both conditions lead to extinction. […] Only at the edge of chaos can complex systems flourish. This threshold line, that edge between anarchy and frozen rigidity, is not a like a fence line, it is a fractal line; it possesses nonlinearity. (Stephen H Buhner, "Plant Intelligence and the Imaginal Realm: Beyond the Doors of Perception into the Dreaming of Earth", 2014)

"The work around the complex systems map supported a concentration on causal mechanisms. This enabled poor system responses to be diagnosed as the unanticipated effects of previous policies as well as identification of the drivers of the sector. Understanding the feedback mechanisms in play then allowed experimentation with possible future policies and the creation of a coherent and mutually supporting package of recommendations for change." (David C Lane et al, "Blending systems thinking approaches for organisational analysis: reviewing child protection", 2015)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

Data Management Series

IEEE defines consistency in general as "the degree of uniformity, standardization, and freedom from contradiction among the documents or parts of a system or component" [4]. In respect to data, consistency can be defined thus as the degree of uniformity and standardization of data values among systems or data repositories (aka cross-system consistencies), records within the same repository (aka cross-record consistency), or within the same record at different points in time (temporal consistency) (see [2], [3]).

Unfortunately, uniformity, standardization and freedom can be considered as data quality dimensions as well and they might even have broader scope than the one provided by consistency. Moreover, the definition requires further definitions for the concept to be understood, which is not ideal.

Simply put, consistency refers to the extent (data) values are consistent in notation, respectively the degree to which the data values across different contexts match. For example, one system uses "Free on Board" while another systems uses "FOB" to refer to the point obligations, costs, and risk involved in the delivery of goods shift from the seller to the buyer. The two systems refer to the same value by two different notations. When the two systems are integrated, because of the different values used, the rules defined in the target system might make the record fail because it is expecting another value. Conversely, other systems may import the value as it is, leading thus to two values used in parallel for the same meaning. This can happen not only to reference data, but also to master data, for example when a value deviates slightly from the expected value (e.g. misspelled). A more special case is when one of the systems uses case sensitive values (usually the target system, though there can be also bidirectional data integrations).

One solution for such situations would be to "standardize" the values across systems, though not all systems allow to easily change the values once they have been set. Another solution would be to create a mapping as part the integration, though to maintain such mappings for many cases is suboptimal, but in the end, it might be the only solution. Further systems can be impacted by these issues as well (e.g. data warehouses, data marts).

It's recommended to use a predefined list of values (LOV) - a data dictionary, an ontology or any other type of knowledge representation form that can be used to 'enforce' data consistency. ‘Enforce’ is maybe not the best term because the two data sets could be disconnected from each other, being in Users’ responsibility to ensure the overall consistency, or the two data sets could be integrated using specific techniques. In many cases is checked the consistency of the values taken by one attribute against an existing LOV, though for example for data formed from multiple segments (e.g. accounts) each segment might need to be checked against a specific data set or rule generator, such mechanisms implying multi-attribute mappings or associational rules that specify the possible values.

As highlighted also by [1], there are two aspects of consistency: the structural consistency in which two or more values can be distinct in notation but have the same meaning (e.g. missing vs. n/a), and semantic consistency in which each value has a unique meaning (e.g. only n/a is allowed to highlight missing values). It should be targeted to have the data semantically consistent, to avoid confusion, accidental exclusion of data during filtering or reporting. More and more organizations are investing in ontologies, they allow ensuring the semantic consistency of concepts/entities, though for most of the cases simple single or multi-attribute lists of values are enough.

Previous Post <<||>> Next Post

Written: Jan-2010, Last Reviewed: Mar-2024

References:
[1] Chapman A.D. (2005) "Principles of Data Quality", version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen
[2] David Loshin (2009) Master Data Management

[3] IEEE (1990) "IEEE Standard Glossary of Software Engineering Terminology"

[4] DAMA International (2010) "The DAMA Dictionary of Data Management" 1st Ed.

16 July 2009

🛢DBMS: Referential Integrity (Definitions)

"The rules governing data consistency, specifically the relationships among the primary keys and foreign keys of different tables. SQL Server addresses referential integrity with user-defined triggers." (Karen Paulsell et al, "Sybase SQL Server: Performance and Tuning Guide", 1996)

"When a table has relationships with other tables, they are linked on a field (or group of fields). Referential integrity ensures that the copy of the key field kept in one table matches the key field in the other." (Owen Williams, "MCSE TestPrep: SQL Server 6.5 Design and Implementation", 1998)

"An integrity mechanism that ensures that vital data in a database, such as the unique identifier for a given piece of data, remains accurate and usable as the database changes. Referential integrity involves managing corresponding data values between tables when the foreign key of a table contains the same values as the primary key of another table." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Mandatory condition in a data warehouse where all the keys in the fact tables are legitimate foreign keys relative to the dimension tables. In other words, all the fact key components are subsets of the primary keys found in the dimension tables at all times." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A state in which all foreign key values in a database are valid." (Anthony Sequeira & Brian Alderman, "The SQL Server 2000 Book", 2003)

"A method employed by a relational database system that enforces one-to-many relationships between tables." (Bob Bryla, "Oracle Database Foundations", 2004)

"A feature of some database systems that ensures that any record stored in the database is supported by accurate primary and foreign keys." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"The facility of a DBMS to ensure the validity of predefined relationships." (William H Inmon, "Building the Data Warehouse", 2005)

"A process (usually contained within a relational database model) of validation between related primary and foreign key field values. For example, a foreign key value cannot be added to a table unless the related primary key value exists in the parent table. Similarly, deleting a primary key value necessitates removing all records in subsidiary tables, containing that primary key value in foreign key fields. Additionally, it follows that preventing the deletion of a primary key record is not allowed if a foreign key exists elsewhere." (Gavin Powell, "Beginning Database Design", 2006)

"The assurance that a reference from one entity to another entity is valid. If entity A references entity B, entity B exists. If entity B is removed, all references to entity B must also be removed." (Pramod J Sadalage & Scott W Ambler, "Refactoring Databases: Evolutionary Database Design", 2006)

"Relational database integrity that dictates that all foreign key values in a child table must have a corresponding matching primary key value in the parent table." (Marilyn Miller-White et al, "MCITP Administrator: Microsoft SQL Server 2005 Optimization and Maintenance 70-444", 2007)

"The referential integrity imposes the constraint that if a foreign key exists in a relation, either the foreign key value must match a candidate key value of some tuple in its home relation or the foreign key value must be wholly null." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"A set of rules, enforced by the database server, the user’s application, or both, that protects the quality and consistency of information stored in the database." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"Requires that relationships among tables be consistent. For example, foreign key constraints must be satisfied. You cannot accept a transaction until referential integrity is satisfied." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"A constraint on a relation that states that every non-null foreign key value must match an existing primary key value." (Jan L Harrington, "Relational Database Design and Implementation" 3rd Ed., 2009)

"A constraint in a SQL database that requires, for every foreign key instance that exists in a table, that the row (and thus the primary key instance) of the parent table associated with that foreign key instance must also exist in the database." (Toby J Teorey, ", Database Modeling and Design" 4th Ed, 2010)

"A constraint on a relation that states that every non-null foreign key value must reference an existing primary key value." (Jan L Harrington, "SQL Clearly Explained" 3rd Ed., 2010)

"In a relational database, the quality of a table that all its associations are with real instances of other tables." (David C Hay, "Data Model Patterns: A Metadata Map", 2010)

"Refers to two relational tables that are directly related. Referential integrity between related tables is established if non-null values in the foreign key field of the child table are primary key values in the parent table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A condition by which a dependent table’s foreign key must have either a null entry or a matching entry in the related table. Even though an attribute may not have a corresponding attribute, it is impossible to have an invalid entry." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"In data management, constraints that govern the relationship of an occurrence of one entity to one or more occurrences of another entity. These constraints may be automatically enforced by the DBMS. For instance, every purchase order must have one and only one customer. If the relationship is represented using a foreign key, then the foreign key is said to reference a file or entity table where the identifier is from the same domain. Having referential integrity means that IF a value exists in the foreign key of the referencing file, then it must exist as a valid identifier in the referenced file or table." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Through the specification of appropriate referential constraints, RI guarantees that an acceptable value is always in each foreign key column." (Craig S Mullins, "Database Administration", 2012)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance: Concepts, Strategies, and Best Practices", 2014)

"The process of relating data together in a disciplined manner" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"A requirement that the data in related tables be matched, so that an entry in the 'many' side of the relationship (the foreign key) must have a corresponding entry in the “one” side of the relationship (the primary key)." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"Refers to the accuracy and consistency of records, and the assurance that they are genuine and unaltered." (Robert F Smallwood, "Information Governance for Healthcare Professionals", 2018)

"The state of a database in which all values of all foreign keys are valid. Maintaining referential integrity requires the enforcement of a referential constraint on all operations that change the data in a table where the referential constraints are defined." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"A rule defined on a key in one table that guarantees that the values in that key match the values in a key in a related table (the referenced value)." (Oracle, "Oracle Database Concepts")

"A state in which all foreign key values in a database are valid. For a foreign key to be valid, it must contain either the value NULL, or an existing key value from the primary or unique key columns referenced by the foreign key." (Microsoft Technet)

"The technique of maintaining data always in a consistent format, part of the ACID philosophy. In particular, data in different tables is kept consistent through the use of foreign key constraints, which can prevent changes from happening or automatically propagate those changes to all related tables. Related mechanisms include the unique constraint, which prevents duplicate values from being inserted by mistake, and the NOT NULL constraint, which prevents blank values from being inserted by mistake." (MySQL, "MySQL 8.0 Reference Manual Glossary")

SQL Troubles

Pages

27 January 2025

🗄️🗒️Data Management: Data Quality Dimensions [Notes]

01 February 2021

📦Data Migrations (DM): Quality Assurance (Part II: Quality Acceptance Criteria II)

13 September 2020

🎓Knowledge Management: Definitions II (What's in a Name)

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part II: The Import Layer)

04 May 2019

🧊Data Warehousing: Architecture (Part I: Push vs. Pull)

24 December 2018

🔭Data Science: Models (Just the Quotes)

08 November 2018

🔭Data Science - Consistency (Just the Quotes)

19 January 2017

⛏️Data Management: Consistency (Definitions)

17 December 2014

🕸Systems Engineering: Coherence (Just the Quotes)

18 January 2010

🗄️Data Management: Data Quality Dimensions (Part V: Consistency)

16 July 2009

🛢DBMS: Referential Integrity (Definitions)

About Me