Showing posts with label accuracy. Show all posts
Showing posts with label accuracy. Show all posts

03 October 2023

ERP Implementations: Simplifying the Implementation Project

 

ERP Implementation

ERPimplementations are complex projects and a way to manage their complexity is to attempt reducing their complexity (instead of answering to complexity by complexity). A project implementation’s methodology is probably the most important area that allows project’s simplification, though none of the available methodologies seems to work well with such projects.

The point that differentiates the various methodologies is solution’s conceptualization. In general, the expectation is to have a set of functional design documents (FDDs) that describe how the system operates and that can be used for programming the customizations, if any. The customer must review and sign-off the FDDs before the setup is done, respectively the development starts. Moreover, given the dependencies between documents, they often need to be signed off together.

Unfortunately, FDDs reflect the degree of understanding of the target system and business requirements, gaps that can prove to be a challenge for the parties involved, requiring many iterations until they are brought to the expected quality level. The higher the accuracy considered; the more iterations are needed. FDDs tend to consume a considerable percent of the available financial resources, in extremis the whole budget being exhausted just for 'printed paper'. Moreover, the key users see late in the project the working functionality.

In agile methodologies, FDDs are replaced by user stories, and, if still needed, can be written as part of the sprints or later. Unfortunately, agile methodologies have their own challenges and constraints in ERP implementations. As functionality is explored, understood, and negotiated with the customer during the implementation, it’s seldom possible to provide a realistic cost estimation upfront. Given that most ERP implementations exceed their budget, starting a journey without having an idea how much the project costs seems to be a prohibitive approach for many customers. Moreover, the negotiations have the character of Change Requests, which can easily become a bottleneck for the project.

On the other hand, agile methodologies involve the customer earlier and the development could start earlier as well. The earlier the customer is involved, the earlier the key users understand how the system works, and thus they can be more efficient in performing their activities, respectively in identifying the gaps in understanding, trapping functional issues early in the process, at least in theory. Some projects address this need by having the key user trained, though the training environment usually has a different setup and data than needed by the customer. Wouldn’t be a good idea to have the key users trained in an environment that reflects to a higher or lower degree the customer’s data and setup requirements?

In theory the setup for such an environment can be done upfront based on one standard configuration frequently met in customer’s industry. With this the functional consultants can start to configure the system together with the key users exploring the data and setup existing in the legacy system(s). This would allow increasing on both sides the depth of understanding and has the potential of speeding up the implementation. This can be started in the early phases, during the time in which the requirements are gathered. Ideally, a basic setup can exist already when the requirements are signed off. It’s true that this approach would mean a higher investment upfront, though the impact could be considerable. Excepting Data Migration and customizations the customer already has a good basis for Go-Live.

Of course, there can be further challenges, though the customer can make thus sure that the financial resources are well spent – having a usable system, respectively a good system understanding outweighs by far the extreme alternative of having high-quality unimplemented FDDs!

Previous <<||>> Next

20 December 2018

Data Science: Accuracy (Just the Quotes)

"Accurate and minute measurement seems to the nonscientific imagination a less lofty and dignified work than looking for something new. But nearly all the grandest discoveries of science have been but the rewards of accurate measurement and patient long contained labor in the minute sifting of numerical results." (William T Kelvin, "Report of the British Association For the Advancement of Science" Vol. 41, 1871)

"It is surprising to learn the number of causes of error which enter into the simplest experiment, when we strive to attain rigid accuracy." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"The test of the accuracy and completeness of a description is, not that it may assist, but that it cannot mislead." (Burt G Wilder, "A Partial Revision of Anatomical Nomenclature", Science, 1881)

"Accuracy of statement is one of the first elements of truth; inaccuracy is a near kin to falsehood." (Tyron Edwards, "A Dictionary of Thoughts", 1891)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Arthur L Bowley, "Elements of Statistics", 1901)

"Great numbers are not counted correctly to a unit, they are estimated; and we might perhaps point to this as a division between arithmetic and statistics, that whereas arithmetic attains exactness, statistics deals with estimates, sometimes very accurate, and very often sufficiently so for their purpose, but never mathematically exact." (Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Accuracy is the foundation of everything else." (Thomas H Huxley, "Method and Results", 1893)

"An experiment is an observation that can be repeated, isolated and varied. The more frequently you can repeat an observation, the more likely are you to see clearly what is there and to describe accurately what you have seen. The more strictly you can isolate an observation, the easier does your task of observation become, and the less danger is there of your being led astray by irrelevant circumstances, or of placing emphasis on the wrong point. The more widely you can vary an observation, the more clearly will be the uniformity of experience stand out, and the better is your chance of discovering laws." (Edward B Titchener, "A Text-Book of Psychology", 1909)

"Science begins with measurement and there are some people who cannot be measurers; and just as we distinguish carpenters who can work to this or that traction of an inch of accuracy, so we must distinguish ourselves and our acquaintances as able to observe and record to this or that degree of truthfulness." (John A Thomson, "Introduction to Science", 1911)

"The ordinary mathematical treatment of any applied science substitutes exact axioms for the approximate results of experience, and deduces from these axioms the rigid mathematical conclusions. In applying this method it must not be forgotten that the mathematical developments transcending the limits of exactness of the science are of no practical value. It follows that a large portion of abstract mathematics remains without finding any practical application, the amount of mathematics that can be usefully employed in any science being in proportion to the degree of accuracy attained in the science. Thus, while the astronomer can put to use a wide range of mathematical theory, the chemist is only just beginning to apply the first derivative, i. e. the rate of change at which certain processes are going on; for second derivatives he does not seem to have found any use as yet." (Felix Klein, "Lectures on Mathematics", 1911)

"It [science] involves an intelligent and persistent endeavor to revise current beliefs so as to weed out what is erroneous, to add to their accuracy, and, above all, to give them such shape that the dependencies of the various facts upon one another may be as obvious as possible." (John Dewey, "Democracy and Education", 1916)

"The man of science, by virtue of his training, is alone capable of realising the difficulties - often enormous - of obtaining accurate data upon which just judgment may be based." (Sir Richard Gregory, "Discovery; or, The Spirit and Service of Science", 1918)

"The complexity of a system is no guarantee of its accuracy." (John P Jordan, "Cost accounting; principles and practice", 1920)

"Science does not aim at establishing immutable truths and eternal dogmas; its aim is to approach the truth by successive approximations, without claiming that at any stage final and complete accuracy has been achieved." (Bertrand Russell, "The ABC of Relativity", 1925)

"Science is but a method. Whatever its material, an observation accurately made and free of compromise to bias and desire, and undeterred by consequence, is science." (Hans Zinsser, "Untheological Reflections", The Atlantic Monthly, 1929)

"The structure of a theoretical system tells us what alternatives are open in the possible answers to a given question. If observed facts of undoubted accuracy will not fit any of the alternatives it leaves open, the system itself is in need of reconstruction." (Talcott Parsons, "The structure of social action", 1937)

"Science, in the broadest sense, is the entire body of the most accurately tested, critically established, systematized knowledge available about that part of the universe which has come under human observation. For the most part this knowledge concerns the forces impinging upon human beings in the serious business of living and thus affecting man’s adjustment to and of the physical and the social world. […] Pure science is more interested in understanding, and applied science is more interested in control […]" (Austin L Porterfield, "Creative Factors in Scientific Research", 1941)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Being built on concepts, hypotheses, and experiments, laws are no more accurate or trustworthy than the wording of the definitions and the accuracy and extent of the supporting experiments." (Gerald Holton, "Introduction to Concepts and Theories in Physical Science", 1952)

"Scientists whose work has no clear, practical implications would want to make their decisions considering such things as: the relative worth of (1) more observations, (2) greater scope of his conceptual model, (3) simplicity, (4) precision of language, (5) accuracy of the probability assignment." (C West Churchman, "Costs, Utilities, and Values", 1956)

"The precision of a number is the degree of exactness with which it is stated, while the accuracy of a number is the degree of exactness with which it is known or observed. The precision of a quantity is reported by the number of significant figures in it." (Edmund C Berkeley & Lawrence Wainwright, Computers: Their Operation and Applications", 1956)

"The art of using the language of figures correctly is not to be over-impressed by the apparent air of accuracy, and yet to be able to take account of error and inaccuracy in such a way as to know when, and when not, to use the figures. This is a matter of skill, judgment, and experience, and there are no rules and short cuts in acquiring this expertness." (Ely Devons, "Essays in Economics", 1961)

"The two most important characteristics of the language of statistics are first, that it describes things in quantitative terms, and second, that it gives this description an air of accuracy and precision." (Ely Devons, "Essays in Economics", 1961)

"Relativity is inherently convergent, though convergent toward a plurality of centers of abstract truths. Degrees of accuracy are only degrees of refinement and magnitude in no way affects the fundamental reliability, which refers, as directional or angular sense, toward centralized truths. Truth is a relationship." (R Buckminster Fuller, "The Designers and the Politicians", 1962)

"Theories are usually introduced when previous study of a class of phenomena has revealed a system of uniformities. […] Theories then seek to explain those regularities and, generally, to afford a deeper and more accurate understanding of the phenomena in question. To this end, a theory construes those phenomena as manifestations of entities and processes that lie behind or beneath them, as it were." (Carl G Hempel, "Philosophy of Natural Science", 1966)

"Numbers are the product of counting. Quantities are the product of measurement. This means that numbers can conceivably be accurate because there is a discontinuity between each integer and the next. Between two and three there is a jump. In the case of quantity there is no such jump, and because jump is missing in the world of quantity it is impossible for any quantity to be exact. You can have exactly three tomatoes. You can never have exactly three gallons of water. Always quantity is approximate." (Gregory Bateson, "Number is Different from Quantity", CoEvolution Quarterly, 1978)

"Science has become a social method of inquiring into natural phenomena, making intuitive and systematic explorations of laws which are formulated by observing nature, and then rigorously testing their accuracy in the form of predictions. The results are then stored as written or mathematical records which are copied and disseminated to others, both within and beyond any given generation. As a sort of synergetic, rigorously regulated group perception, the collective enterprise of science far transcends the activity within an individual brain." (Lynn Margulis & Dorion Sagan, "Microcosmos", 1986)

"A theory is a good theory if it satisfies two requirements: it must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations." (Stephen Hawking, "A Brief History of Time: From Big Bang To Black Holes", 1988)

"Science is (or should be) a precise art. Precise, because data may be taken or theories formulated with a certain amount of accuracy; an art, because putting the information into the most useful form for investigation or for presentation requires a certain amount of creativity and insight." (Patricia H Reiff, "The Use and Misuse of Statistics in Space Physics", Journal of Geomagnetism and Geoelectricity 42, 1990)

"There is no sharp dividing line between scientific theories and models, and mathematics is used similarly in both. The important thing is to possess a delicate judgement of the accuracy of your model or theory. An apparently crude model can often be surprisingly effective, in which case its plain dress should not mislead. In contrast, some apparently very good models can be hiding dangerous weaknesses." (David Wells, "You Are a Mathematician: A wise and witty introduction to the joy of numbers", 1995)

"Science is more than a mere attempt to describe nature as accurately as possible. Frequently the real message is well hidden, and a law that gives a poor approximation to nature has more significance than one which works fairly well but is poisoned at the root." (Robert H March, "Physics for Poets", 1996)

"Accuracy of observation is the equivalent of accuracy of thinking." (Wallace Stevens, "Collected Poetry and Prose", 1997)

“Accurate estimates depend at least as much upon the mental model used in forming the picture as upon the number of pieces of the puzzle that have been collected.” (Richards J. Heuer Jr, “Psychology of Intelligence Analysis”, 1999)

"To be numerate means to be competent, confident, and comfortable with one’s judgements on whether to use mathematics in a particular situation and if so, what mathematics to use, how to do it, what degree of accuracy is appropriate, and what the answer means in relation to the context." (Diana Coben, "Numeracy, mathematics and adult learning", 2000)

"Innumeracy - widespread confusion about basic mathematical ideas - means that many statistical claims about social problems don't get the critical attention they deserve. This is not simply because an innumerate public is being manipulated by advocates who cynically promote inaccurate statistics. Often, statistics about social problems originate with sincere, well-meaning people who are themselves innumerate; they may not grasp the full implications of what they are saying. Similarly, the media are not immune to innumeracy; reporters commonly repeat the figures their sources give them without bothering to think critically about them." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"Most physical systems, particularly those complex ones, are extremely difficult to model by an accurate and precise mathematical formula or equation due to the complexity of the system structure, nonlinearity, uncertainty, randomness, etc. Therefore, approximate modeling is often necessary and practical in real-world applications. Intuitively, approximate modeling is always possible. However, the key questions are what kind of approximation is good, where the sense of 'goodness' has to be first defined, of course, and how to formulate such a good approximation in modeling a system such that it is mathematically rigorous and can produce satisfactory results in both theory and applications." (Guanrong Chen & Trung Tat Pham, "Introduction to Fuzzy Sets, Fuzzy Logic, and Fuzzy Control Systems", 2001)

"There are two problems with sampling - one obvious, and  the other more subtle. The obvious problem is sample size. Samples tend to be much smaller than their populations. [...] Obviously, it is possible to question results based on small samples. The smaller the sample, the less confidence we have that the sample accurately reflects the population. However, large samples aren't necessarily good samples. This leads to the second issue: the representativeness of a sample is actually far more important than sample size. A good sample accurately reflects (or 'represents') the population." (Joel Best, "Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and Activists", 2001)

"[…] most earlier attempts to construct a theory of complexity have overlooked the deep link between it and networks. In most systems, complexity starts where networks turn nontrivial. No matter how puzzled we are by the behavior of an electron or an atom, we rarely call it complex, as quantum mechanics offers us the tools to describe them with remarkable accuracy. The demystification of crystals-highly regular networks of atoms and molecules-is one of the major success stories of twentieth-century physics, resulting in the development of the transistor and the discovery of superconductivity. Yet, we continue to struggle with systems for which the interaction map between the components is less ordered and rigid, hoping to give self-organization a chance." (Albert-László Barabási, "Linked: How Everything Is Connected to Everything Else and What It Means for Business, Science, and Everyday Life", 2002)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage.". (Margaret Y Chu, "Blissful Data", 2004)

"[…] we would like to observe that the butterfly effect lies at the root of many events which we call random. The final result of throwing a dice depends on the position of the hand throwing it, on the air resistance, on the base that the die falls on, and on many other factors. The result appears random because we are not able to take into account all of these factors with sufficient accuracy. Even the tiniest bump on the table and the most imperceptible move of the wrist affect the position in which the die finally lands. It would be reasonable to assume that chaos lies at the root of all random phenomena." (Iwo Bialynicki-Birula & Iwona Bialynicka-Birula, "Modeling Reality: How Computers Mirror Life", 2004)

"A scientific theory is a concise and coherent set of concepts, claims, and laws (frequently expressed mathematically) that can be used to precisely and accurately explain and predict natural phenomena." (Mordechai Ben-Ari, "Just a Theory: Exploring the Nature of Science", 2005)

"Coincidence surprises us because our intuition about the likelihood of an event is often wildly inaccurate." (Michael Starbird, "Coincidences, Chaos, and All That Math Jazz", 2005)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Humans have difficulty perceiving variables accurately […]. However, in general, they tend to have inaccurate perceptions of system states, including past, current, and future states. This is due, in part, to limited ‘mental models’ of the phenomena of interest in terms of both how things work and how to influence things. Consequently, people have difficulty determining the full implications of what is known, as well as considering future contingencies for potential systems states and the long-term value of addressing these contingencies. " (William B. Rouse, "People and Organizations: Explorations of Human-Centered Design", 2007) 

"Perception requires imagination because the data people encounter in their lives are never complete and always equivocal. [...] We also use our imagination and take shortcuts to fill gaps in patterns of nonvisual data. As with visual input, we draw conclusions and make judgments based on uncertain and incomplete information, and we conclude, when we are done analyzing the patterns, that out picture is clear and accurate. But is it?" (Leonard Mlodinow, "The Drunkard’s Walk: How Randomness Rules Our Lives", 2008)

"Prior to the discovery of the butterfly effect it was generally believed that small differences averaged out and were of no real significance. The butterfly effect showed that small things do matter. This has major implications for our notions of predictability, as over time these small differences can lead to quite unpredictable outcomes. For example, first of all, can we be sure that we are aware of all the small things that affect any given system or situation? Second, how do we know how these will affect the long-term outcome of the system or situation under study? The butterfly effect demonstrates the near impossibility of determining with any real degree of accuracy the long term outcomes of a series of events." (Elizabeth McMillan, Complexity, "Management and the Dynamics of Change: Challenges for practice", 2008)

"In the predictive modeling disciplines an ensemble is a group of algorithms that is used to solve a common problem [...] Each modeling algorithm has specific strengths and weaknesses and each provides a different mathematical perspective on the relationships modeled, just like each instrument in a musical ensemble provides a different voice in the composition. Predictive modeling ensembles use several algorithms to contribute their perspectives on the prediction problem and then combine them together in some way. Usually ensembles will provide more accurate models than individual algorithms which are also more general in their ability to work well on different data sets [...] the approach has proven to yield the best results in many situations." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications", 2012)

"The problem of complexity is at the heart of mankind’s inability to predict future events with any accuracy. Complexity science has demonstrated that the more factors found within a complex system, the more chances of unpredictable behavior. And without predictability, any meaningful control is nearly impossible. Obviously, this means that you cannot control what you cannot predict. The ability ever to predict long-term events is a pipedream. Mankind has little to do with changing climate; complexity does." (Lawrence K Samuels, "The Real Science Behind Changing Climate", 2014)

“A mathematical model is a mathematical description (often by means of a function or an equation) of a real-world phenomenon such as the size of a population, the demand for a product, the speed of a falling object, the concentration of a product in a chemical reaction, the life expectancy of a person at birth, or the cost of emission reductions. The purpose of the model is to understand the phenomenon and perhaps to make predictions about future behavior. [...] A mathematical model is never a completely accurate representation of a physical situation - it is an idealization." (James Stewart, “Calculus: Early Transcedentals” 8th Ed., 2016)

"Validity of a theory is also known as construct validity. Most theories in science present broad conceptual explanations of relationship between variables and make many different predictions about the relationships between particular variables in certain situations. Construct validity is established by verifying the accuracy of each possible prediction that might be made from the theory. Because the number of predictions is usually infinite, construct validity can never be fully established. However, the more independent predictions for the theory verified as accurate, the stronger the construct validity of the theory." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"The margin of error is how accurate the results are, and the confidence interval is how confident you are that your estimate falls within the margin of error." (Daniel J Levitin, "Weaponized Lies", 2017)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"The only way to achieve any accuracy is to ignore most of the information available." (Preston C Hammer) 

13 December 2018

Data Science: Bayesian Networks (Just the Quotes)

"The best way to convey to the experimenter what the data tell him about theta is to show him a picture of the posterior distribution." (George E P Box & George C Tiao, "Bayesian Inference in Statistical Analysis", 1973)

"In the design of experiments, one has to use some informal prior knowledge. How does one construct blocks in a block design problem for instance? It is stupid to think that use is not made of a prior. But knowing that this prior is utterly casual, it seems ludicrous to go through a lot of integration, etc., to obtain 'exact' posterior probabilities resulting from this prior. So, I believe the situation with respect to Bayesian inference and with respect to inference, in general, has not made progress. Well, Bayesian statistics has led to a great deal of theoretical research. But I don't see any real utilizations in applications, you know. Now no one, as far as I know, has examined the question of whether the inferences that are obtained are, in fact, realized in the predictions that they are used to make." (Oscar Kempthorne, "A conversation with Oscar Kempthorne", Statistical Science, 1995)

"Bayesian methods are complicated enough, that giving researchers user-friendly software could be like handing a loaded gun to a toddler; if the data is crap, you won't get anything out of it regardless of your political bent." (Brad Carlin, "Bayes offers a new way to make sense of numbers", Science, 1999)

"Bayesian inference is a controversial approach because it inherently embraces a subjective notion of probability. In general, Bayesian methods provide no guarantees on long run performance." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Bayesian inference is appealing when prior information is available since Bayes’ theorem is a natural way to combine prior information with data. Some people find Bayesian inference psychologically appealing because it allows us to make probability statements about parameters. […] In parametric models, with large samples, Bayesian and frequentist methods give approximately the same inferences. In general, they need not agree." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The Bayesian approach is based on the following postulates: (B1) Probability describes degree of belief, not limiting frequency. As such, we can make probability statements about lots of things, not just data which are subject to random variation. […] (B2) We can make probability statements about parameters, even though they are fixed constants. (B3) We make inferences about a parameter θ by producing a probability distribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, such as confidence intervals, use frequentist methods. Generally, Bayesian methods run into problems when the parameter space is high dimensional." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004) 

"Bayesian networks can be constructed by hand or learned from data. Learning both the topology of a Bayesian network and the parameters in the CPTs in the network is a difficult computational task. One of the things that makes learning the structure of a Bayesian network so difficult is that it is possible to define several different Bayesian networks as representations for the same full joint probability distribution." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015) 

"Bayesian networks provide a more flexible representation for encoding the conditional independence assumptions between the features in a domain. Ideally, the topology of a network should reflect the causal relationships between the entities in a domain. Properly constructed Bayesian networks are relatively powerful models that can capture the interactions between descriptive features in determining a prediction." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015) 

"Bayesian networks use a graph-based representation to encode the structural relationships - such as direct influence and conditional independence - between subsets of features in a domain. Consequently, a Bayesian network representation is generally more compact than a full joint distribution (because it can encode conditional independence relationships), yet it is not forced to assert a global conditional independence between all descriptive features. As such, Bayesian network models are an intermediary between full joint distributions and naive Bayes models and offer a useful compromise between model compactness and predictive accuracy." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Bayesian networks inhabit a world where all questions are reducible to probabilities, or (in the terminology of this chapter) degrees of association between variables; they could not ascend to the second or third rungs of the Ladder of Causation. Fortunately, they required only two slight twists to climb to the top." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The main differences between Bayesian networks and causal diagrams lie in how they are constructed and the uses to which they are put. A Bayesian network is literally nothing more than a compact representation of a huge probability table. The arrows mean only that the probabilities of child nodes are related to the values of parent nodes by a certain formula (the conditional probability tables) and that this relation is sufficient. That is, knowing additional ancestors of the child will not change the formula. Likewise, a missing arrow between any two nodes means that they are independent, once we know the values of their parents. [...] If, however, the same diagram has been constructed as a causal diagram, then both the thinking that goes into the construction and the interpretation of the final diagram change." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"The transparency of Bayesian networks distinguishes them from most other approaches to machine learning, which tend to produce inscrutable 'black boxes'. In a Bayesian network you can follow every step and understand how and why each piece of evidence changed the network’s beliefs." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"With Bayesian networks, we had taught machines to think in shades of gray, and this was an important step toward humanlike thinking. But we still couldn’t teach machines to understand causes and effects. [...] By design, in a Bayesian network, information flows in both directions, causal and diagnostic: smoke increases the likelihood of fire, and fire increases the likelihood of smoke. In fact, a Bayesian network can’t even tell what the 'causal direction' is." (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

27 December 2017

Data Management: Data Quality (Just the Quotes)

"[...] it is a function of statistical method to emphasize that precise conclusions cannot be drawn from inadequate data." (Egon S Pearson & H Q Hartley, "Biometrika Tables for Statisticians" Vol. 1, 1914)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"There is no substitute for honest, thorough, scientific effort to get correct data (no matter how much it clashes with preconceived ideas). There is no substitute for actually reaching a correct chain of reasoning. Poor data and good reasoning give poor results. Good data and poor reasoning give poor results. Poor data and poor reasoning give rotten results." (Edmund C Berkeley, "Computers and Automation", 1969)

"Detailed study of the quality of data sources is an essential part of applied work. [...] Data analysts need to understand more about the measurement processes through which their data come. To know the name by which a column of figures is headed is far from being enough." (John W Tukey, "An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects", 1982)

"We have found that some of the hardest errors to detect by traditional methods are unsuspected gaps in the data collection (we usually discovered them serendipitously in the course of graphical checking)." (Peter Huber, "Huge data sets", Compstat '94: Proceedings, 1994)

"Data obtained without any external disturbance or corruption are called clean; noisy data mean that a small random ingredient is added to the clean data." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Probability theory is a serious instrument for forecasting, but the devil, as they say, is in the details - in the quality of information that forms the basis of probability estimates." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Unfortunately, just collecting the data in one place and making it easily available isn’t enough. When operational data from transactions is loaded into the data warehouse, it often contains missing or inaccurate data. How good or bad the data is a function of the amount of input checking done in the application that generates the transaction. Unfortunately, many deployed applications are less than stellar when it comes to validating the inputs. To overcome this problem, the operational data must go through a 'cleansing' process, which takes care of missing or out-of-range values. If this cleansing step is not done before the data is loaded into the data warehouse, it will have to be performed repeatedly whenever that data is used in a data mining operation." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"If the data is usually bad, and you find that you have to gather some data, what can you do to do a better job? First, recognize what I have repeatedly said to you, the human animal was not designed to be reliable; it cannot count accurately, it can do little or nothing repetitive with great accuracy. [...] Second, you cannot gather a really large amount of data accurately. It is a known fact which is constantly ignored. It is always a matter of limited resources and limited time. [...] Third, much social data is obtained via questionnaires. But it a well documented fact the way the questions are phrased, the way they are ordered in sequence, the people who ask them or come along and wait for them to be filled out, all have serious effects on the answers."  (Richard Hamming, "The Art of Doing Science and Engineering: Learning to Learn", 1997)

"Blissful data consist of information that is accurate, meaningful, useful, and easily accessible to many people in an organization. These data are used by the organization’s employees to analyze information and support their decision-making processes to strategic action. It is easy to see that organizations that have reached their goal of maximum productivity with blissful data can triumph over their competition. Thus, blissful data provide a competitive advantage." (Margaret Y Chu, "Blissful Data", 2004)

"Let’s define dirty data as: ‘… data that are incomplete, invalid, or inaccurate’. In other words, dirty data are simply data that are wrong. […] Incomplete or inaccurate data can result in bad decisions being made. Thus, dirty data are the opposite of blissful data. Problems caused by dirty data are significant; be wary of their pitfalls."  (Margaret Y Chu, "Blissful Data", 2004)

"Processes must be implemented to prevent bad data from entering the system as well as propagating to other systems. That is, dirty data must be intercepted at its source. The operational systems are often the source of informational data; thus dirty data must be fixed at the operational data level. Implementing the right processes to cleanse data is, however, not easy." (Margaret Y Chu, "Blissful Data", 2004)

"Equally critical is to include data quality definition and acceptable quality benchmarks into the conversion specifications. No product design skips quality specifications. including quality metrics and benchmarks. Yet rare data conversion follows suit. As a result, nobody knows how successful the conversion project was until data errors get exposed in the subsequent months and years. The solution is to perform comprehensive data quality assessment of the target data upon conversion and compare the results with pre-defined benchmarks." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Much data in databases has a long history. It might have come from old 'legacy' systems or have been changed several times in the past. The usage of data fields and value codes changes over time. The same value in the same field will mean totally different thing in different records. Knowledge or these facts allows experts to use the data properly. Without this knowledge, the data may bc used literally and with sad consequences. The same is about data quality. Data users in the trenches usually know good data from bad and can still use it efficiently. They know where to look and what to check. Without these experts, incorrect data quality assumptions are often made and poor data quality becomes exposed." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The big part of the challenge is that data quality does not improve by itself or as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies and better information systems. [...] In reality, most IT processes affect data quality negatively, Thus, if we do nothing, data quality will continuously deteriorate to the point where the data will become a huge liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"While we might attempt to identify and correct most data errors, as well as try to prevent others from entering the database, the data quality will never be perfect. Perfection is practically unattainable in data quality as with the quality of most other products. In truth, it is also unnecessary since at some point improving data quality becomes more expensive than leaving it alone. The more efficient our data quality program, the higher level of quality we will achieve- but never will it reach 100%. However, accepting imperfection is not the same as ignoring it. Knowledge of the data limitations and imperfections can help use the data wisely and thus save time and money, The challenge, of course, is making this knowledge organized and easily accessible to the target users. The solution is a comprehensive integrated data quality meta data warehouse." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"Achieving a high level of data quality is hard and is affected significantly by organizational and ownership issues. In the short term, bandaging problems rather than addressing the root causes is often the path of least resistance." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Communicate loudly and widely where there are data quality problems and the associated risks with deploying BI tools on top of bad data. Also advise the different stakeholders on what can be done to address data quality problems - systematically and organizationally. Complaining without providing recommendations fixes nothing." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Data quality is such an important issue, and yet one that is not well understood or that excites business users. It’s often perceived as being a problem for IT to handle when it’s not: it’s for the business to own and correct." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Depending on the extent of the data quality issues, be careful about where you deploy BI. Without a reasonable degree of confidence in the data quality, BI should be kept in the hands of knowledge workers and not extended to frontline workers and certainly not to customers and suppliers. Deploy BI in this limited fashion as data quality issues are gradually exposed, understood, and ultimately, addressed. Don’t wait for every last data quality issue to be resolved; if you do, you will never deliver any BI capabilities, business users will never see the problem, and quality will never improve." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Our culture, obsessed with numbers, has given us the idea that what we can measure is more important than what we can't measure. Think about that for a minute. It means that we make quantity more important than quality." (Donella Meadows, "Thinking in Systems: A Primer", 2008)

"The data architecture is the most important technical aspect of your business intelligence initiative. Fail to build an information architecture that is flexible, with consistent, timely, quality data, and your BI initiative will fail. Business users will not trust the information, no matter how powerful and pretty the BI tools. However, sometimes it takes displaying that messy data to get business users to understand the importance of data quality and to take ownership of a problem that extends beyond business intelligence, to the source systems and to the organizational structures that govern a company’s data." (Cindi Howson, "Successful Business Intelligence: Secrets to making BI a killer App", 2008)

"Many new data scientists tend to rush past it to get their data into a minimally acceptable state, only to discover that the data has major quality issues after they apply their (potentially computationally intensive) algorithm and get a nonsense answer as output. (Sandy Ryza, "Advanced Analytics with Spark: Patterns for Learning from Data at Scale", 2009)

"Access to more information isn’t enough - the information needs to be correct, timely, and presented in a manner that enables the reader to learn from it. The current network is full of inaccurate, misleading, and biased information that often crowds out the valid information. People have not learned that 'popular' or 'available' information is not necessarily valid." (Gene Spafford, 2010)

"Are data quality and data governance the same thing? They share the same goal, essentially striving for the same outcome of optimizing data and information results for business purposes. Data governance plays a very important role in achieving high data quality. It deals primarily with orchestrating the efforts of people, processes, objectives, technologies, and lines of business in order to optimize outcomes around enterprise data assets. This includes, among other things, the broader cross-functional oversight of standards, architecture, business processes, business integration, and risk and compliance. Data governance is an organizational structure that oversees the compliance and standards of enterprise data." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Data governance is about putting people in charge of fixing and preventing data issues and using technology to help aid the process. Any time data is synchronized, merged, and exchanged, there have to be ground rules guiding this. Data governance serves as the method to organize the people, processes, and technologies for data-driven programs like data quality; they are a necessary part of any data quality effort." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Data quality is an elusive subject that can defy measurement and yet be critical enough to derail a single IT project, strategic initiative, or even an entire company." (Neera Bhansali, "Data Governance: Creating Value from Information Assets", 2014)

"Accuracy and coherence are related concepts pertaining to data quality. Accuracy refers to the comprehensiveness or extent of missing data, performance of error edits, and other quality assurance strategies. Coherence is the degree to which data - item value and meaning are consistent over time and are comparable to similar variables from other routinely used data sources." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"How good the data quality is can be looked at both subjectively and objectively. The subjective component is based on the experience and needs of the stakeholders and can differ by who is being asked to judge it. For example, the data managers may see the data quality as excellent, but consumers may disagree. One way to assess it is to construct a survey for stakeholders and ask them about their perception of the data via a questionnaire. The other component of data quality is objective. Measuring the percentage of missing data elements, the degree of consistency between records, how quickly data can be retrieved on request, and the percentage of incorrect matches on identifiers (same identifier, different social security number, gender, date of birth) are some examples." (Aileen Rothbard, "Quality Issues in the Use of Administrative Data Records", 2015)

"When we find data quality issues due to valid data during data exploration, we should note these issues in a data quality plan for potential handling later in the project. The most common issues in this regard are missing values and outliers, which are both examples of noise in the data." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Metadata is the key to effective data governance. Metadata in this context is the data that defines the structure and attributes of data. This could mean data types, data privacy attributes, scale, and precision. In general, quality of data is directly proportional to the amount and depth of metadata provided. Without metadata, consumers will have to depend on other sources and mechanisms." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"The quality of data that flows within a data pipeline is as important as the functionality of the pipeline. If the data that flows within the pipeline is not a valid representation of the source data set(s), the pipeline doesn’t serve any real purpose. It’s very important to incorporate data quality checks within different phases of the pipeline. These checks should verify the correctness of data at every phase of the pipeline. There should be clear isolation between checks at different parts of the pipeline. The checks include checks like row count, structure, and data type validation." (Saurabh Gupta et al, "Practical Enterprise Data Lake Insights", 2018)

"Are your insights based on data that is accurate and reliable? Trustworthy data is correct or valid, free from significant defects and gaps. The trustworthiness of your data begins with the proper collection, processing, and maintenance of the data at its source. However, the reliability of your numbers can also be influenced by how they are handled during the analysis process. Clean data can inadvertently lose its integrity and true meaning depending on how it is analyzed and interpreted." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"First, from an ethos perspective, the success of your data story will be shaped by your own credibility and the trustworthiness of your data. Second, because your data story is based on facts and figures, the logos appeal will be integral to your message. Third, as you weave the data into a convincing narrative, the pathos or emotional appeal makes your message more engaging. Fourth, having a visualized insight at the core of your message adds the telos appeal, as it sharpens the focus and purpose of your communication. Fifth, when you share a relevant data story with the right audience at the right time (kairos), your message can be a powerful catalyst for change." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"The one unique characteristic that separates a data story from other types of stories is its fundamental basis in data. [...] The building blocks of every data story are quantitative or qualitative data, which are frequently the results of an analysis or insightful observation. Because each data story is formed from a collection of facts, each one represents a work of nonfiction. While some creativity may be used in how the story is structured and delivered, a true data story won’t stray too far from its factual underpinnings. In addition, the quality and trustworthiness of the data will determine how credible and powerful the data story is." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"Data is dirty. Let's just get that out there. How is it dirty? In all sorts of ways. Misspelled text values, date format problems, mismatching units, missing values, null values, incompatible geospatial coordinate formats, the list goes on and on." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020) 

"[...] the governance function is accountable to define what constitutes data quality and how each data product communicates that in a standard way. It’s no longer accountable for the quality of each data product. The platform team is accountable to build capabilities to validate the quality of the data and communicate its quality metrics, and each domain (data product owner) is accountable to adhere to the quality standards and provide quality data products." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"[...] data mesh introduces a fundamental shift that the owners of the data products must communicate and guarantee an acceptable level of quality and trustworthiness - specific to their domain - as an intrinsic characteristic of their data product. This means cleansing and running automated data integrity tests at the point of the creation of a data product." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Ensure you build into your data literacy strategy learning on data quality. If the individuals who are using and working with data do not understand the purpose and need for data quality, we are not sitting in a strong position for great and powerful insight. What good will the insight be, if the data has no quality within the model?" (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Bad data is costly to fix, and it’s more costly the more widespread it is. Everyone who has accessed, used, copied, or processed the data may be affected and may require mitigating action on their part. The complexity is further increased by the fact that not every consumer will “fix” it in the same way. This can lead to divergent results that are divergent with others and can be a nightmare to detect, track down, and rectify." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"Data has historically been treated as a second-class citizen, as a form of exhaust or by-product emitted by business applications. This application-first thinking remains the major source of problems in today’s computing environments, leading to ad hoc data pipelines, cobbled together data access mechanisms, and inconsistent sources of similar-yet-different truths. Data mesh addresses these shortcomings head-on, by fundamentally altering the relationships we have with our data. Instead of a secondary by-product, data, and the access to it, is promoted to a first-class citizen on par with any other business service." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023)

"The problem of bad data has existed for a very long time. Data copies diverge as their original source changes. Copies get stale. Errors detected in one data set are not fixed in duplicate ones. Domain knowledge related to interpreting and understanding data remains incomplete, as does support from the owners of the original data." (Adam Bellemare, "Building an Event-Driven Data Mesh: Patterns for Designing and Building Event-Driven Architectures", 2023) 

"Errors using inadequate data are much less than those using no data at all." (Charles Babbage)

12 February 2017

Data Management: Data Quality (Definitions)

"Data are of high quality if they are fit for their intended use in operations, decision-making, and planning." (Joseph M Juran, 1964)

"[…] data has quality if it satisfies the requirements of its intended use. It lacks quality to the extent that it does not satisfy the requirement. In other words, data quality depends as much on the intended use as it does on the data itself. To satisfy the intended use, the data must be accurate, timely, relevant, complete, understood, and trusted." (Jack E Olson, "Data Quality: The Accuracy Dimension", 2003)

"A set of measurable characteristics of data that define how well data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Customer Data Integration for a Global Enterprise", 2007)

"The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use." (Keith Gordon, "Principles of Data Management", 2007)

"Deals with data validation and cleansing services (to ensure relevance, validity, accuracy, and consistency of the master data), reconciliation services (aimed at helping cleanse the master data of duplicates as part of consistency), and cross-reference services (to help with matching master data across multiple systems)." (Martin Oberhofer et al,"Enterprise Master Data Management", 2008)

"A set of data properties (features, parameters, etc.) describing their ability to satisfy user’s expectations or requirements concerning data using for information acquiring in a given area of interest, learning, decision making, etc." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"Assessment of the cleanliness, accuracy, and reliability of data." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A set of measurable characteristics of data that define how well the data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"This term refers to whether an organization’s data is reliable, consistent, up to date, free of duplication, and can be used efficiently across the organization." (Tony Fisher, "The Data Asset", 2009)

"A set of measurable characteristics of data that define how well the data represents the real-world construct to which it refers." (Alex Berson & Lawrence Dubov, "Master Data Management and Data Governance", 2010)

"The degree of data accuracy, accessibility, relevance, time-liness, and completeness." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The degree of fitness for use of data in particular application. Also the degree to which data conforms to data specifications as measured in data quality dimensions. Sometimes used interchangeably with information quality." (John R Talburt, "Entity Resolution and Information Quality", 2011) 

"The degree to which data is accurate, complete, timely, consistent with all requirements and business rules, and relevant for a given use." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Contextual data quality considers the extent to which data are applicable (pertinent) to the task of the data user, not to the context of representation itself. Contextually appropriate data must be relevant to the consumer, in terms of timeliness and completeness. Dimensions include: value-added, relevancy, timeliness, completeness, and appropriate amount of data (from the Wang & Strong framework.)" (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Intrinsic data quality denotes that data have quality in their own right; it is understood largely as the extent to which data values are in conformance with the actual or true values. Intrinsically good data is accurate, correct, and objective, and comes from a reputable source. Dimensions include: accuracy objectivity, believability, and reputation (from the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Representational data quality indicates that the system must present data in such a way that it is easy to understand (represented concisely and consistently) so that the consumer is able to interpret the data; understood as the extent to which data is presented in an intelligible and clear manner. Dimensions include: interpretability, ease of understanding, representational consistency, and concise representation (rom the Wang & Strong framework)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"The level of quality of data represents the degree to which data meets the expectations of data consumers, based on their intended use of the data." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement", 2013) 

"The relative value of data, which is based on the accuracy of the knowledge that can be generated using that data. High-quality data is consistent, accurate, and unambiguous, and it can be processed efficiently." (Jim Davis & Aiman Zeid, "Business Transformation: A Roadmap for Maximizing Organizational Insights", 2014)

"The properties of data embodied by the “Five C’s”: clean, consistent, conformed, current, and comprehensive." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The degree to which data in an IT system is complete, up-to-date, consistent, and (syntactically and semantically) correct." (Tilo Linz et al, "Software Testing Foundations, 4th Ed", 2014)

"A measure for the suitability of data for certain requirements in the business processes, where it is used. Data quality is a multi-dimensional, context-dependent concept that cannot be described and measured by a single characteristic, but rather various data quality dimensions. The desired level of data quality is thereby oriented on the requirements in the business processes and functions, which use this data [...]" (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

"[...] characteristics of data such as consistency, accuracy, reliability, completeness, timeliness, reasonableness, and validity. Data-quality software ensures that data elements are represented in a consistent way across different data stores or systems, making the data more trustworthy across the enterprise." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Refers to the accuracy, completeness, timeliness, integrity, and acceptance of data as determined by its users." (Gregory Lampshire, "The Data and Analytics Playbook", 2016)

"A measure of the useableness of data. An ideal dataset is accurate, complete, timely in publication, consistent in its naming of items and its handling of e.g. missing data, and directly machine-readable (see data cleaning), conforms to standards of nomenclature in the field, and is published with sufficient metadata that users can easily understand, for example, who it is published by and the meaning of the variables in the dataset." (Open Data Handbook) 

"Refers to the level of 'quality' in data. If a particular data store is seen as holding highly relevant data for a project, that data is seen as quality to the users." (Solutions Review)

"the processes and techniques involved in ensuring the reliability and application efficiency of data. Data is of high quality if it reliably reflects underlying processes and fits the intended uses in operations, decision making and planning." (KDnuggets)

"The narrow definition of data quality is that it's about data that is missing or incorrect. A broader definition is that data quality is achieved when a business uses data that is comprehensive, consistent, relevant and timely." (Information Management)

"Data Quality refers to the accuracy of datasets, and the ability to analyse and create actionable insights for other users." (experian) [source]

"Data quality refers to the current condition of data and whether it is suitable for a specific business purpose." (Xplenty) [source]

19 January 2010

Business Intelligence: Levels of Accuracy in Reporting

Business Intelligence
Business Intelligence

Correlated with the level of detail (aka granularity), another aspect that needs to be considered in reports is results’ levels of accuracy – how much the data reflect the reality at each level of detail. Because the integration between modules of the same or different systems brought some of the attributes from one module into the other (e.g. Document Numbers, UIDs, Dates), the reports can consider the respective attributes in queries without involving the modules they come from. Thus for example in order to create an AP (Account Payables) Invoices report, could be sufficient to use the GL (General Ledger) Date stored in AP without using the GL data directly; on the other side, a more accurate report is obtain when using the AP and GL data, such a report could include also the manual postings made in GL and not available in AP. 

Typically the data from one module that are stored in other modules, should be in synch, though there could be exceptions and exceptions. In the end it is developer’s task to understand the level of accuracy/detail the users need; in general it doesn’t always make sense to provide the highest level of accuracy/detail when the user needs only some rough number, as higher level of accuracy/details equate with more effort and more tables added to a report.

When considering the different levels of accuracy/detail within reports, given two modules/entities A and B, we might end up created a set of reports with:
a.  all records from A
b.  all records from B;
c.  all records from A and matching records from B, and vice-versa;
d.  all records from A and aggregated data from B when the level of detail is changed.
e.  aggregated data from A vs. aggregated data from B in case of many-to-many cardinality, the aggregation being made at a level of detail in which the amounts are not duplicated.
f.  mismatches between A and B for the same scope (it can be shown at different levels of details).

These scenarios would allow Users to choose the report with the needed level of detail/accuracy, though covering all existing scenarios could be quite expensive, not to neglect the fact that there are reports spanning more than 2 modules. On the other side, two reports for A and B would be sufficient as long as the reference attributes between modules are provided, falling in Users’ task to match and aggregate the data, though also this alternative could prove to be problematic because of the volume of data, synchronization issues between data sets or lack of adequate skill set, and all related issues. The first solution is the ideal, the second is workable, though in the end it depends what makes the users happy.

Previous Post <<||>> Next Post

18 January 2010

Data Management: Accuracy (Definitions)

"(1) A qualitative assessment of correctness, or freedom from error. (2) A quantitative measure of the magnitude of error." (IEEE, "IEEE Standard Glossary of Software Engineering Terminology", 1990)

[accuracy (of measurement):] "Closeness of the agreement between the result of a measurement and a true value of the measurand." International Vocabulary of Basic and General Terms in Metrology, 1993)

"A qualitative assessment of freedom from error or a quantitative measure of the magnitude of error, expressed as a function of relative error." (William H Inmon, "Building the Data Warehouse", 2005)

"Accuracy is the closeness of a measured value to the true value." (Steve McKillup, "Statistics Explained: An Introductory Guide for Life Scientists", 2005)

"A data element’s degree of conformity to an established business measurement or definition. Data precision is the degree to which further measurements or definitions will show the same results." (Jill Dyché & Evan Levy, "Customer Data Integration: Reaching a Single Version of the Truth", 2006)

"Degree of conformity of a measure to a standard or a true value. Level of precision or detail." (Martin J Eppler, "Managing Information Quality" 2nd Ed., 2006)

"The accuracy reflects the number of times the model is correct." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"An aspect of numerical data quality connected with a standard statistical error between a real parameter value and the corresponding value given by the data. Data accuracy is inversely proportional to this error." (Juliusz L Kulikowski, "Data Quality Assessment", 2009)

"An inherent quality characteristic that is a measure of the degree to which data agrees with an original source of data (such as a form, document, or unaltered electronic data) received from an acknowledged source outside the control of the organization." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to a surrogate source]

"An inherent quality characteristic that is a measure of the degree to which data accurately reflects the real-world object or event being described. Accuracy is the highest degree of inherent information quality possible." (David C Hay, "Data Model Patterns: A Metadata Map", 2010) [accuracy in regard to reality]

"Freedom from mistakes or error, conformity to truth or to a standard, exactness, the degree of conformity of a measure to a standard or true value. (Michael Brackett, 2011)

"The degree to which a data attribute value closely and correctly describes its business entity instance (the 'real life' entities) as of a point in time." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Accuracy is the quality or state of being correct or precise; accurate information is correct in all details (NOAD)." (Laura Sebastian-Coleman, "Measuring Data Quality for Ongoing Improvement ", 2012)

"Within the quality management system, accuracy is an assessment of correctness." (For Dummies, "PMP Certification All-in-One For Dummies" 2nd Ed., 2013)

"How closely a measurement or assessment reflects the true value. Not to be confused with precision [...]" (Kenneth A Shaw, "Integrated Management of Processes and Information", 2013)

"Accuracy is defined as a measure of whether the value of a given data element is correct and reflects the real world as viewed by a valid real-world source (SME, customer, hard-copy record, etc.)." (Rajesh Jugulum, "Competing with High Quality Data", 2014)

"Within the quality management system, accuracy is an assessment of correctness." (Project Management Institute, "A Guide to the Project Management Body of Knowledge (PMBOK® Guide)" 6th Ed., 2017)

"The degree to which the data reflect the truth or reality. A spelling mistake is a good example of inaccurate data." (Piethein Strengholt, "Data Management at Scale", 2020)

"The degree to which the semantic assertions of a model are accepted to be true." (Panos Alexopoulos, "Semantic Modeling for Data", 2020)

"The degree of how closely the data represents the true value of the attribute in the real-world context." (Zhamak Dehghani, "Data Mesh: Delivering Data-Driven Value at Scale", 2021)

"Closeness of computations or estimates to the exact or true values that the statistics were intended to measure." (SDMX) 

"The capability of the software product to provide the right or agreed results or effects with the needed degree of precision." [ISO/IEC 25000]

 "The closeness of agreement between an observed value and an accepted reference value." (American Society for Quality)

"The term “accuracy” refers to the degree to which information accurately reflects an event or object described." (Precisely) [source]

17 January 2010

Data Management: Data Quality Dimensions (Part IV: Accuracy)

Data Management
Data Management Series

Accuracy refers to the extent data is correct, matching the reality with an acceptable level of approximation. Correctness, the value of being correct, same as reality are vague terms, in many cases they are a question of philosophy, perception, having a high degree of interpretability. However, in what concerns data they are typically the result of measurement, therefore a measurement of accuracy relates to the degree the data deviate from physical laws, logics or defined rules, though also this context is a swampy field because, utilizing a well-known syntagm, everything is relative. 

From a scientific point of view, we try to model the reality with mathematical models which offer various level of approximation, the more we learn about our world, the more flaws we discover in the existing models, it’s a continuous quest for finding better models that approximate the reality. Things don’t have to be so complicated, for basic measurements there are many tools out there that offer acceptable results for most of the requirements, on the other side, as requirements change, better approximations might be required with time.

Another concept related with the ones of accuracy and measurement systems is the one of precision, and it refers to degree repeated measurements under unchanged conditions lead to the same results, further concepts associated with it being the ones of repeatability and reproducibility. Even if the accuracy and precision concepts are often confounded a measurement system can be accurate but not precise or precise but not accurate (see the target analogy), a valid measurement system targeting thus both aspects. Accuracy and precision can be considered dimensions of correctedness.

Coming back to accuracy and its use in determining data quality, typically accuracy it’s strong related to the measurement tools used, for this being needed to do again the measurements for all or a sample of the dataset and identify whether the requested level of accuracy is met, approach that could involve quite an effort. The accuracy depends also on whether the systems used to store the data are designed to store the data at the requested level of accuracy, fact reflected by the characteristics of data types used (e.g. precision, length).

Given the fact that a system stores related data (e.g. weight, height, width, length) that could satisfy physical, business of common-sense rules could be used rules to check whether the data satisfy them with the desired level of approximation. For example, being known the height, width, length and the composition of a material (e.g. metal bar) could be determined the approximated weight and compared with entered weight, if the difference is not inside of a certain interval then most probably one of the values were incorrect entered. There are even simpler rules that might apply, for example the physical dimensions must be positive real values, or in a generalized formulation - involve maximal or minimal limits that lead to identification of outliers, etc. In fact, most of the time determining data accuracy resumes only at defining possible value intervals, though there will be also cases in which for this purpose are built complex models and specific techniques.

There is another important aspect related to accuracy, time dependency of data – whether the data changes or not with time. Data currency or actuality refers to the extent data are actual. Given the above definition for accuracy, currency could be considered as a special type of accuracy because when the data are not actual then they don’t reflect reality. If currency is considered as a standalone data quality dimension, then accuracy refers only to the data that are not time dependent.


Written: Jan-2010, Last Reviewed: Mar-2024
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.