Showing posts with label dimensions. Show all posts
Showing posts with label dimensions. Show all posts

18 May 2024

Graphical Representation: Graphics We Live By (Part IV: Area Charts in MS Excel)

Graphical Representation
Graphical Representation

An area chart or area graph (see A) is a graphical representation of quantitative data based on a line chart for which the areas between axis and the lines of the series are commonly emphasized with colors, textures, or hatchings (Wikipedia). It resembles a combination between line and bar charts. Each data series results in the formation of a region (aka area), allowing thus to identify the overlapping and do comparisons between the lines within the same visual display. This approach works usually well for two or three data series if the lines don't overlap, though if more data series are added to the chart, the higher are the chances for lines to overlap or for one area to be covered by another (see B). This can easily become more than the chart can handle, even if the data series can be filtered dynamically.

Area Charts
Area Charts

Stacked area charts are a variation of area charts in which the areas are stacked, much like stacked bar charts (see C). Research papers abound with such charts, probably because they allow to stack together multiple data series within a small area, reflecting thus the many variables involved. Such charts allow to track individual as well as intermediary and total aggregated trends.

Stacked Area Charts
Stacked Area Charts

Unfortunately, besides the fact that some areas are barely distinguishable or that distant areas can't be compared (especially when one area in between has strong fluctuations), the lack of ticks and/or gridlines (see D) makes it difficult to interpret such charts. Moreover, when the lines are smoothed, it becomes even more difficult to identify the actual points. To address this it makes sense to use markers for data points to show that one works with discrete and not continuous points (see further paragraphs).

In general, it's recommended to reduce the number of data series to 3-5. For example, one can split the data series into 2-3 groups or categories based on series' characteristics (e.g. concentrate on the high values in one chart, respectively the low values in another, or group the low values under an "others" category) which would allow to make better comparisons.

Being able to sort the time series on their average value or other criteria (e.g. showing the areas with minimal variations first) can improve the readability of such charts.

Moreover, areas under curves can easily hide missing data (see F) and occasionally negative values (which is the case of the 8th example), or distort the rate of change when the charts are wider than needed (compare F with C). 

Line Chart, respectively Area Chart based on a subset
Area Charts Variations

Area charts seem to encode a dimension as area, though that's not necessarily the case. It seems natural to display time series of different granularities (day, month, quarter, year), though one needs to be careful about one important aspect! On a time scale, the more one moves away from the day to weeks and months as time units, the bigger the distance between points is. In the end, all the points in a series are discrete points (not continuous), though the bigger the distance, the more category-like these series become (compare F with C, the charts have the same width).

Using the area under the curve as dimension makes sense when there's continuity or the discrete points are close enough to each other to resemble continuity. Thus, area charts are useful when the number of points is high (and the distance between them becomes neglectable), e.g. showing daily values within a year or the months over several years. 

According to [2], [3] and several other sources, using the area to encode quantitative information is a poor graphical method and this applies to pie charts and area charts altogether. By contrast, for a bar chart (see G) one has either height or width to use for comparisons while the points are always as bars delimited. Scatter plots (see H), even if they might miss the time dimension, they better reflect the dispersion of the points along the lines delimited by encoding the color (compare H with E). 

Column Chart and Scatter Plot
Alternatives for Area Charts

The more category-like and the fewer data points the data series have, the higher the chances for other graphical representation tools to be able to better represent the data. For example, year or even quarter-based data can be better visualized with Sankey charts (unfortunately, not available as standard Excel visual yet).

Conversely, there are situations in which the area chart isn't supposed to convey specific values but to get a feeling of areas' shape, or its simplicity is more appropriate, situations in which area charts do a good job. In the end, a graphical representation's utility is linked to a chart's purpose (and audience, of course). 

References:
[1] Wikipedia (2023) Area charts (link)
[2] William S Cleveland (1993) Visualizing Data
[3] Robert L Harris (1996) Information Graphics: A Comprehensive Illustrated Reference

20 November 2018

Data Science: Dimensionality (Just the Quotes)

"[…] the intrinsic value of a small-scale model is that it compensates for the renunciation of sensible dimensions by the acquisition of intelligible dimensions." (Claude Levi- Strauss, "The Savage Mind", 1962)

"The idea of knowledge as an improbable structure is still a good place to start. Knowledge, however, has a dimension which goes beyond that of mere information or improbability. This is a dimension of significance which is very hard to reduce to quantitative form. Two knowledge structures might be equally improbable but one might be much more significant than the other." (Kenneth E Boulding, "Beyond Economics: Essays on Society", 1968)

"A time series is a sequence of observations, usually ordered in time, although in some cases the ordering may be according to another dimension. The feature of time series analysis which distinguishes it from other statistical analysis is the explicit recognition of the importance of the order in which the observations are made. While in many problems the observations are statistically independent, in time series successive observations may be dependent, and the dependence may depend on the positions in the sequence. The nature of a series and the structure of its generating process also may involve in other ways the sequence in which the observations are taken." (Theodore W Anderson, "The Statistical Analysis of Time Series", 1971)

"The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.(Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"In addition to dimensionality requirements, chaos can occur only in nonlinear situations. In multidimensional settings, this means that at least one term in one equation must be nonlinear while also involving several of the variables. With all linear models, solutions can be expressed as combinations of regular and linear periodic processes, but nonlinearities in a model allow for instabilities in such periodic solutions within certain value ranges for some of the parameters." (Courtney Brown, "Chaos and Catastrophe Theories", 1995)

"The dimensionality and nonlinearity requirements of chaos do not guarantee its appearance. At best, these conditions allow it to occur, and even then under limited conditions relating to particular parameter values. But this does not imply that chaos is rare in the real world. Indeed, discoveries are being made constantly of either the clearly identifiable or arguably persuasive appearance of chaos. Most of these discoveries are being made with regard to physical systems, but the lack of similar discoveries involving human behavior is almost certainly due to the still developing nature of nonlinear analyses in the social sciences rather than the absence of chaos in the human setting."  (Courtney Brown, "Chaos and Catastrophe Theories", 1995)

"A system may be called complex here if its dimension (order) is too high and its model (if available) is nonlinear, interconnected, and information on the system is uncertain such that classical techniques can not easily handle the problem." (M Jamshidi, "Autonomous Control on Complex Systems: Robotic Applications", Current Advances in Mechanical Design and Production VII, 2000)

"The greatest plus of data modeling is that it produces a simple and understandable picture of the relationship between the input variables and responses [...] different models, all of them equally good, may give different pictures of the relation between the predictor and response variables [...] One reason for this multiplicity is that goodness-of-fit tests and other methods for checking fit give a yes–no answer. With the lack of power of these tests with data having more than a small number of dimensions, there will be a large number of models whose fit is acceptable. There is no way, among the yes–no methods for gauging fit, of determining which is the better model." (Leo Breiman, "Statistical Modeling: The two cultures" Statistical Science 16(3), 2001)

"Three key aspects of presenting high dimensional data are: rendering, manipulation, and linking. Rendering determines what is to be plotted, manipulation determines the structure of the relationships, and linking determines what information will be shared between plots or sections of the graph." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"With the ever increasing amount of empirical information that scientists from all disciplines are dealing with, there exists a great need for robust, scalable and easy to use clustering techniques for data abstraction, dimensionality reduction or visualization to cope with and manage this avalanche of data."  (Jörg Reichardt, "Structure in Complex Networks", 2009)

"The more dimensions used in quantitative comparisons, the larger are the disparities that can be accommodated. As irony would have it, however, the ease of comparison generally diminishes in direct proportion to the number of dimensions involved." (Joel Katz, "Designing Information: Human factors and common sense in information design", 2012)

"Dimensionality reduction and regression modeling are particularly hard to interpret in terms of original attributes, when the underlying data dimensionality is high. This is because the subspace embedding is defined as a linear combination of attributes with positive or negative coefficients. This cannot easily be intuitively interpreted in terms specific properties of the data attributes." (Charu C Aggarwal, "Outlier Analysis", 2013)

"Dimensionality reduction is essential for coping with big data - like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. [...] A common complaint about big data is that the more data you have, the easier it is to find spurious patterns in it. This may be true if the data is just a huge set of disconnected entities, but if they’re interrelated, the picture changes." (Pedro Domingos, "The Master Algorithm", 2015)

"The correlational technique known as multiple regression is used frequently in medical and social science research. This technique essentially correlates many independent (or predictor) variables simultaneously with a given dependent variable (outcome or output). It asks, 'Net of the effects of all the other variables, what is the effect of variable A on the dependent variable?' Despite its popularity, the technique is inherently weak and often yields misleading results. The problem is due to self-selection. If we don’t assign cases to a particular treatment, the cases may differ in any number of ways that could be causing them to differ along some dimension related to the dependent variable. We can know that the answer given by a multiple regression analysis is wrong because randomized control experiments, frequently referred to as the gold standard of research techniques, may give answers that are quite different from those obtained by multiple regression analysis." (Richard E Nisbett, "Mindware: Tools for Smart Thinking", 2015)

"Understanding reduces the complexity of data by collapsing the dimensionality of information to a lower set of known variables. s revolutions, be they tiny or vast, technological or social." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"Dimensionality reduction is a way of reducing a large number of different measures into a smaller set of metrics. The intent is that the reduced metrics are a simpler description of the complex space that retains most of the meaning." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"The higher the dimension, in other words, the higher the number of possible interactions, and the more disproportionally difficult it is to understand the macro from the micro, the general from the simple units. This disproportionate increase of computational demands is called the curse of dimensionality." (Nassim N Taleb, "Skin in the Game: Hidden Asymmetries in Daily Life", 2018)

"This problem with adding additional variables is referred to as the curse of dimensionality. If you add enough variables into your black box, you will eventually find a combination of variables that performs well - but it may do so by chance. As you increase the number of variables you use to make your predictions, you need exponentially more data to distinguish true predictive capacity from luck." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"We all know that the numerical values on each side of an equation have to be the same. The key to dimensional analysis is that the units have to be the same as well. This provides a convenient way to keep careful track of units when making calculations in engineering and other quantitative disciplines, to make sure one is computing what one thinks one is computing. When an equation exists only for the sake of mathiness, dimensional analysis often makes no sense." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

19 February 2015

Business Intelligence: Measures (Definitions)

"A quantitative, numerical column in a fact table. Measures typically represent the values that are analyzed. See also dimension." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A metric is a measurable or quantitative value." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A measure is a dimensional modeling term that refers to values, usually numeric, that measure some aspect of the business. Measures reside in fact tables. The dimensional terms measure and attribute, taken together, are equivalent to the relational modeling use of the term attribute." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"(1) A mapping from empirical properties to quantities in a formal mathematical model called a measurement scale. (2) To obtain a measurement." (Richard D Stutzke, "Estimating Software-Intensive Systems: Projects, Products, and Processes", 2005)

"In Dimensional modeling, a specific data item that describes a fact or aggregation of facts. Measures are implemented as metric facts." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A summarizable numerical value used to monitor business activity; it is also known as a fact. " (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A column of quantifiable data mapped to a dimension within a cube. Measures are often used to provide access to aggregations of data (such as annual sales of a product or a store), while also giving the ability to drill down into the details (such as quarterly or monthly sales)." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

[business measure:] "Business performance metric captured by an operational system and represented as a physical or computed fact in a dimensional model." (Ralph Kimball, "The Data Warehouse Lifecycle Toolkit", 2008)

"A set of usually numeric values from a fact table that is aggregated in a cube across all dimensions." (Jim Joseph et al, Microsoft® SQL Server 2008 Reporting Services Unleashed, 2009)

[business measures:] "The complete set of facts, base and derived, that are defined and made available for reporting and analysis." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"A quantitative performance indicator or success factor that can be traced on an ongoing basis to determine successful operation and progress toward objectives and goals." (David Lyle & John G. Schmidt, "Lean Integration", 2010)

"1.Loosely used, a metric. 2.In data modeling, a quantified characteristic; the unit used to quantify the dimensions, capacity, or amount of something." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Value assigned (noun) or the process of assigning a value (verb) to an object through calculation, appraisal, estimation, or some other method." (Leslie G Eldenburg & Susan K. Wolcott, "Cost Management" 2nd Ed., 2011)

"In a cube, a set of values that are usually numeric and are based on a column in the fact table of the cube. Measures are the central values that are aggregated and analyzed." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The act of identifying what to measure as well as actually collecting the measures that would help an organization understand if the process is operating within acceptable limits." (Project Management Institute, "Organizational Project Management Maturity Model (OPM3®)" 3rd Ed., 2013)

"Metrics such as count, maximum, minimum, sum, or average that are used in a fact table. Measures can be calculated with an SQL expression or mapped directly to a numeric value in a column." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The number or category assigned to an attribute of an entity by making a measurement. (ISO 14598)

19 December 2014

Systems Engineering: Feedback (Just the Quotes)

"Feedback is a method of controlling a system by reinserting into it the results of its past performance. If these results are merely used as numerical data for the criticism of the system and its regulation, we have the simple feedback of the control engineers. If, however, the information which proceeds backward from the performance is able to change the general method and pattern of performance, we have a process which may be called learning." (Norbert Wiener, 1954)

"[...] the concept of 'feedback', so simple and natural in certain elementary cases, becomes artificial and of little use when the interconnexions between the parts become more complex. When there are only two parts joined so that each affects the other, the properties of the feedback give important and useful information about the properties of the whole. But when the parts rise to even as few as four, if every one affects the other three, then twenty circuits can be traced through them; and knowing the properties of all the twenty circuits does not give complete information about the system. Such complex systems cannot be treated as an interlaced set of more or less independent feedback circuits, but only as a whole. For understanding the general principles of dynamic systems, therefore, the concept of feedback is inadequate in itself. What is important is that complex systems, richly cross-connected internally, have complex behaviours, and that these behaviours can be goal-seeking in complex patterns." (W Ross Ashby, "An Introduction to Cybernetics", 1956)

"Traditional organizational theories have tended to view the human organization as a closed system. This tendency has led to a disregard of differing organizational environments and the nature of organizational dependency on environment. It has led also to an over-concentration on principles of internal organizational functioning, with consequent failure to develop and understand the processes of feedback which are essential to survival." (Daniel Katz, "The Social Psychology of Organizations", 1966)

"The structure of a complex system is not a simple feedback loop where one system state dominates the behavior. The complex system has a multiplicity of interacting feedback loops. Its internal rates of flow are controlled by non‐linear relationships. The complex system is of high order, meaning that there are many system states (or levels). It usually contains positive‐feedback loops describing growth processes as well as negative, goal‐seeking loops." (Jay W Forrester, "Urban Dynamics", 1969)

"To model the dynamic behavior of a system, four hierarchies of structure should be recognized: closed boundary around the system; feedback loops as the basic structural elements within the boundary; level variables representing accumulations within the feedback loops; rate variables representing activity within the feedback loops." (Jay W Forrester, "Urban Dynamics", 1969)

"Effect spreads its 'tentacles' not only forwards (as a new cause giving rise to a new effect) but also backwards, to the cause which gave rise to it, thus modifying, exhausting or intensifying its force. This interaction of cause and effect is known as the principle of feedback. It operates everywhere, particularly in all self-organising systems where perception, storing, processing and use of information take place, as for example, in the organism, in a cybernetic device, and in society. The stability, control and progress of a system are inconceivable without feedback." (Alexander Spirkin, "Dialectical Materialism", 1983)

"Ultimately, uncontrolled escalation destroys a system. However, change in the direction of learning, adaptation, and evolution arises from the control of control, rather than unchecked change per se. In general, for the survival and co-evolution of any ecology of systems, feedback processes must be embodied by a recursive hierarchy of control circuits." (Bradford P Keeney, "Aesthetics of Change", 1983)

"Every system of whatever size must maintain its own structure and must deal with a dynamic environment, i.e., the system must strike a proper balance between stability and change. The cybernetic mechanisms for stability (i.e., homeostasis, negative feedback, autopoiesis, equifinality) and change (i.e., positive feedback, algedonodes, self-organization) are found in all viable systems." (Barry Clemson, "Cybernetics: A New Management Tool", 1984) 

"The term closed loop-learning process refers to the idea that one learns by determining what s desired and comparing what is actually taking place as measured at the process and feedback for comparison. The difference between what is desired and what is taking place provides an error indication which is used to develop a signal to the process being controlled." (Harold Chestnut, 1984) 

"The term chaos is used in a specific sense where it is an inherently random pattern of behaviour generated by fixed inputs into deterministic (that is fixed) rules (relationships). The rules take the form of non-linear feedback loops. Although the specific path followed by the behaviour so generated is random and hence unpredictable in the long-term, it always has an underlying pattern to it, a 'hidden' pattern, a global pattern or rhythm. That pattern is self-similarity, that is a constant degree of variation, consistent variability, regular irregularity, or more precisely, a constant fractal dimension. Chaos is therefore order (a pattern) within disorder (random behaviour)." (Ralph D Stacey, "The Chaos Frontier: Creative Strategic Control for Business", 1991)

"In many parts of the economy, stabilizing forces appear not to operate. Instead, positive feedback magnifies the effects of small economic shifts; the economic models that describe such effects differ vastly from the conventional ones. Diminishing returns imply a single equilibrium point for the economy, but positive feedback – increasing returns – makes for many possible equilibrium points. There is no guarantee that the particular economic outcome selected from among the many alternatives will be the ‘best’ one."  (W Brian Arthur, "Returns and Path Dependence in the Economy", 1994)

“[…] self-organization is the spontaneous emergence of new structures and new forms of behavior in open systems far from equilibrium, characterized by internal feedback loops and described mathematically by nonlinear equations.” (Fritjof  Capra, “The web of life: a new scientific understanding of living  systems”, 1996)

"Something of the previous state, however, survives every change. This is called in the language of cybernetics (which took it form the language of machines) feedback, the advantages of learning from experience and of having developed reflexes." (Guy Davenport, "The Geography of the Imagination: Forty Essays", 1997)

"Cybernetics is the science of effective organization, of control and communication in animals and machines. It is the art of steersmanship, of regulation and stability. The concern here is with function, not construction, in providing regular and reproducible behaviour in the presence of disturbances. Here the emphasis is on families of solutions, ways of arranging matters that can apply to all forms of systems, whatever the material or design employed. [...] This science concerns the effects of inputs on outputs, but in the sense that the output state is desired to be constant or predictable – we wish the system to maintain an equilibrium state. It is applicable mostly to complex systems and to coupled systems, and uses the concepts of feedback and transformations (mappings from input to output) to effect the desired invariance or stability in the result." (Chris Lucas, "Cybernetics and Stochastic Systems", 1999)

"All dynamics arise from the interaction of just two types of feedback loops, positive (or self-reinforcing) and negative (or self-correcting) loops. Positive loops tend to reinforce or amplify whatever is happening in the system […] Negative loops counteract and oppose change." (John D Sterman, "Business Dynamics: Systems thinking and modeling for a complex world", 2000)

"Much of the art of system dynamics modeling is discovering and representing the feedback processes, which, along with stock and flow structures, time delays, and nonlinearities, determine the dynamics of a system. […] the most complex behaviors usually arise from the interactions (feedbacks) among the components of the system, not from the complexity of the components themselves." (John D Sterman, "Business Dynamics: Systems thinking and modeling for a complex world", 2000)

“The phenomenon of emergence takes place at critical points of instability that arise from fluctuations in the environment, amplified by feedback loops." (Fritjof Capra, "The Hidden Connections: A Science for Sustainable Living", 2002)

"Thus, nonlinearity can be understood as the effect of a causal loop, where effects or outputs are fed back into the causes or inputs of the process. Complex systems are characterized by networks of such causal loops. In a complex, the interdependencies are such that a component A will affect a component B, but B will in general also affect A, directly or indirectly.  A single feedback loop can be positive or negative. A positive feedback will amplify any variation in A, making it grow exponentially. The result is that the tiniest, microscopic difference between initial states can grow into macroscopically observable distinctions." (Carlos Gershenson, "Design and Control of Self-organizing Systems", 2007)

"The work around the complex systems map supported a concentration on causal mechanisms. This enabled poor system responses to be diagnosed as the unanticipated effects of previous policies as well as identification of the drivers of the sector. Understanding the feedback mechanisms in play then allowed experimentation with possible future policies and the creation of a coherent and mutually supporting package of recommendations for change."  (David C Lane et al, "Blending systems thinking approaches for organisational analysis: reviewing child protection", 2015)

More quotes on "Feedback" at the-web-of-knowledge.blogspot.com.

23 May 2014

Data Science: Fractal (Definitions)

"A fractal is a mathematical set or concrete object that is irregular or fragmented at all scales [...]" (Benoît Mandelbrot, "The Fractal Geometry of Nature", 1982)

"Objects (in particular, figures) that have the same appearance when they are seen on fine and coarse scales." (David Rincón & Sebastià Sallent, Scaling Properties of Network Traffic, 2008) "A collection of objects that have a power-law dependence of number on size." (Donald L Turcotte, "Fractals in Geology and Geophysics", 2009) 

"A fractal is a geometric object which is self-similar and characterized by an effective dimension which is not an integer." (Leonard M Sander, "Fractal Growth Processes", 2009) 

"A fractal is a structure which can be subdivided into parts, where the shape of each part is similar to that of the original structure." (Yakov M Strelniker, "Fractals and Percolation", 2009) 

"A fractal is an image that comprises two distinct attributes: infinite detail and self-similarity." (Daniel C. Doolan et al, "Unlocking the Hidden Power of the Mobile", 2009)

"A geometrical object that is invariant at any scale of magnification or reduction." (Sidney Redner, "Fractal and Multifractal Scaling of Electrical Conduction in Random Resistor Networks", 2009) 

[Fractal structure:] "A pattern or arrangement of system elements that are self-similar at different spatial scales." (Michael Batty, "Cities as Complex Systems: Scaling, Interaction, Networks, Dynamics and Urban Morphologies", 2009) 

"A set whose (suitably defined) geometrical dimensionis non-integral. Typically, the set appears selfsimilar on all scales. A number of geometrical objects associated with chaos (e. g. strange attractors) are fractals." (Oded Regev, "Chaos and Complexity in Astrophysics", 2009) 

[Fractal system:] "A system characterized by a scaling law with a fractal, i. e., non-integer exponent. Fractal systems are self-similar, i. e., a magnification of a small part is statistically equivalent to the whole." (Jan W Kantelhardt, "Fractal and Multifractal Time Series", 2009) 

"An adjective or a noun representing complex configurations having scale-free characteristics or self-similar properties. Mathematically, any fractal can be characterized by a power law distribution." (Misako Takayasu & Hideki Takayasu, "Fractals and Economics", 2009) 

"Fractals are complex mathematical objects that are invariant with respect to dilations (self-similarity) and therefore do not possess a characteristic length scale. Fractal objects display scale-invariance properties that can either fluctuate from point to point (multifractal) or be homogeneous (monofractal). Mathematically, these properties should hold over all scales. However, in the real world, there are necessarily lower and upper bounds over which self-similarity applies." (Alain Arneodo et al, "Fractals and Wavelets: What Can We Learn on Transcription and Replication from Wavelet-Based Multifractal Analysis of DNA Sequences?", 2009) 

"Mathematical object usually having a geometrical representation and whose spatial dimension is not an integer. The relation between the size of the object and its “mass” does not obey that of usual geometrical objects." (Bastien Chopard, "Cellular Automata: Modeling of Physical Systems", 2009) 

 "A fragmented geometric shape that can be split up into secondary pieces, each of which is approximately a smaller replica of the whole, the phenomenon commonly known as self similarity." (Khondekar et al, "Soft Computing Based Statistical Time Series Analysis, Characterization of Chaos Theory, and Theory of Fractals", 2013) 

 "A natural phenomenon or a mathematical set that exhibits a repeating pattern which can be replicated at every scale." (Rohnn B Sanderson, "Understanding Chaos as an Indicator of Economic Stability", 2016) 

 "Geometric pattern repeated at progressively smaller scales, where each iteration is about a reproduction of the image to produce completely irregular shapes and surfaces that can not be represented by classical geometry. Fractals are generally self-similar (each section looks at all) and are not subordinated to a specific scale. They are used especially in the digital modeling of irregular patterns and structures in nature." (Mauro Chiarella, Folds and Refolds: Space Generation, Shapes, and Complex Components, 2016)

23 November 2011

Graphical Representation: Dimensions (Just the Quotes)

"Graphic comparisons, wherever possible, should be made in one dimension only." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"In general, the comparison of two circles of different size should be strictly avoided. Many excellent works on statistics approve the comparison of circles of different size, and state that the circles should always be drawn to represent the facts on an area basis rather than on a diameter basis. The rule, however, is not always followed and the reader has no way of telling whether the circles compared have been drawn on a diameter basis or on an area basis, unless the actual figures for the data are given so that the dimensions may be verified." (Willard C Brinton, "Graphic Methods for Presenting Facts", 1919)

"Readers of statistical diagrams should not be required to compare magnitudes in more than one dimension. Visual comparisons of areas are particularly inaccurate and should not be necessary in reading any statistical graphical diagram." (William C Marshall, "Graphical methods for schools, colleges, statisticians, engineers and executives", 1921)

"The bar chart is one of the most useful, simple, adaptable, and popular techniques in graphic presentation. The simple bar chart. with its many variations, is particularly appropriate for comparing the magnitude, or size, of coordinate items or of parts of a total. The basis of comparison in the bar chart is linear or one-dimensional. The length of each bar or of its components is proportional to the quantity or amount of each category' represented. " (Calvin F Schmid, "Handbook of Graphic Presentation", 1954)

"The common bar chart is particularly appropriate for comparing magnitude or size of coordinate items or parts of a total. It is one of the most useful, simple, and adaptable techniques in graphic presentation. The basis of comparison in the bar chart is linear or one-dimensional. The length of each bar or of its components is proportional to the quantity or amount of each category represented." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"An especially effective device for enhancing the explanatory power of time-series displays is to add spatial dimensions to the design of the graphic, so that the data are moving over space (in two or three dimensions) as well as over time. […] Occasionally graphics are belligerently multivariate, advertising the technique rather than the data." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"Graphical integrity is more likely to result if these six principles are followed:
The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities represented.
Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.
Show data variations, not design variations. 
In time-series displays of money, deflated and standardized units of monetary measurements are nearly always better than nominal units.
The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.
Graphics must not quote data out of context." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The time-series plot is the most frequently used form of graphic design. With one dimension marching along to the regular rhythm of seconds, minutes, hours, days, weeks, months, years, centuries, or millennia, the natural ordering of the time scale gives this design a strength and efficiency of interpretation found in no other graphic arrangement." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The ducks of information design are false escapes from flatland, adding pretend dimensions to impoverished data sets, merely fooling around with information." (Edward R Tufte, "Envisioning Information", 1990)

"We envision information in order to reason about, communicate, document, and preserve that knowledge - activities nearly always carried out on two-dimensional paper and computer screen. Escaping this flatland and enriching the density of data displays are the essential tasks of information design." (Edward R Tufte, "Envisioning Information", 1990)

"Binning has two basic limitations. First, binning sacrifices resolution. Sometimes plots of the raw data will reveal interesting fine structure that is hidden by binning. However, advantages from binning often outweigh the disadvantage from lost resolution. [...] Second, binning does not extend well to high dimensions. With reasonable univariate resolution, say 50 regions each covering 2% of the range of the variable, the number of cells for a mere 10 variables is exceedingly large. For uniformly distributed data, it would take a huge sample size to fill a respectable fraction of the cells. The message is not so much that binning is bad but that high dimensional space is big. The complement to the curse of dimensionality is the blessing of large samples. Even in two and three dimensions having lots of data can bc very helpful when the observations are noisy and the structure non-trivial." (Daniel B Carr, "Looking at Large Data Sets Using Binned Data Plots", [in "Computing and Graphics in Statistics"] 1991)

"Fitting is essential to visualizing hypervariate data. The structure of data in many dimensions can be exceedingly complex. The visualization of a fit to hypervariate data, by reducing the amount of noise, can often lead to more insight. The fit is a hypervariate surface, a function of three or more variables. As with bivariate and trivariate data, our fitting tools are loess and parametric fitting by least-squares. And each tool can employ bisquare iterations to produce robust estimates when outliers or other forms of leptokurtosis are present." (William S Cleveland, "Visualizing Data", 1993)

"The visual representation of a scale - an axis with ticks - looks like a ladder. Scales are the types of functions we use to map varsets to dimensions. At first glance, it would seem that constructing a scale is simply a matter of selecting a range for our numbers and intervals to mark ticks. There is more involved, however. Scales measure the contents of a frame. They determine how we perceive the size, shape, and location of graphics. Choosing a scale (even a default decimal interval scale) requires us to think about what we are measuring and the meaning of our measurements. Ultimately, that choice determines how we interpret a graphic." (Leland Wilkinson, "The Grammar of Graphics" 2nd Ed., 2005)

"It is tempting to make charts more engaging by introducing fancy graphics or three dimensions so they leap off the page, but doing so obscures the real data and misleads people, intentionally or not." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"One way a chart can lie is through overemphasis of the size and scale of items, particularly when the dimension of depth isnʼt considered." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"Using colour, itʼs possible to increase the density of information even further. A single colour can be used to represent two variables simultaneously. The difficulty, however, is that there is a limited amount of information that can be packed into colour without confusion." (Brian Suda, "A Practical Guide to Designing with Data", 2010)

"Bear in mind is that the use of color doesn’t always help. Use it sparingly and with a specific purpose in mind. Remember that the reader’s brain is looking for patterns, and will expect both recurrence itself and the absence of expected recurrence to carry meaning. If you’re using color to differentiate categorical data, then you need to let the reader know what the categories are. If the dimension of data you’re encoding isn’t significant enough to your message to be labeled or explained in some way - or if there is no dimension to the data underlying your use of difference colors - then you should limit your use so as not to confuse the reader." (Noah Iliinsky & Julie Steel, "Designing Data Visualizations", 2011)

"[...] the human brain is not good at calculating surface sizes. It is much better at comparing a single dimension such as length or height. [...] the brain is also a hopelessly lazy machine." (Alberto Cairo, "The Functional Art", 2011)

"Explanatory data visualization is about conveying information to a reader in a way that is based around a specific and focused narrative. It requires a designer-driven, editorial approach to synthesize the requirements of your target audience with the key insights and most important analytical dimensions you are wishing to convey." (Andy Kirk, "Data Visualization: A successful design process", 2012)

"A signal is a useful message that resides in data. Data that isn’t useful is noise. […] When data is expressed visually, noise can exist not only as data that doesn’t inform but also as meaningless non-data elements of the display (e.g. irrelevant attributes, such as a third dimension of depth in bars, color variation that has no significance, and artificial light and shadow effects)." (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"A time series is a sequence of values, usually taken in equally spaced intervals. […] Essentially, anything with a time dimension, measured in regular intervals, can be used for time series analysis." (Andy Kriebel & Eva Murray, "#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time", 2018)

"Color is difficult to use effectively. A small number of well-chosen colors can be highly distinguishable, particularly for categorical data, but it can be difficult for users to distinguish between more than a handful of colors in a visualization. Nonetheless, color is an invaluable tool in the visualization toolbox because it is a channel that can carry a great deal of meaning and be overlaid on other dimensions. […] There are a variety of perceptual effects, such as simultaneous contrast and color deficiencies, that make precise numerical judgments about a color scale difficult, if not impossible." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Maps also have the disadvantage that they consume the most powerful encoding channels in the visualization toolbox - position and size - on an aspect that is held constant. This leaves less effective encoding channels like color for showing the dimension of interest." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

27 February 2010

Data Warehousing: Conformed Dimension (Definitions)

"A shared dimension that applies to two subject areas or data marts. By utilizing conformed dimensions, comparisons across data marts are meaningful." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"Dimensions are conformed when they are either exactly the same (including the keys) or one is a perfect subset of the other." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit 2nd Ed ", 2002)

"A conformed dimension is one that is built for use by multiple data marts. Conformed dimensions promote consistency by enabling multiple data marts to share the same reference and hierarchy information." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A dimension whose data is reused by more than one dimensional design. When modeling multiple data marts, standards across marts with respect to dimensions are useful. Warehouse users may be confused when a dimension has the similar meaning but different names, structures, levels, or characteristics among multiple marts. Using standard dimensions throughout the warehousing environment can be referred to as 'conformed' dimensions." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"Dimension tables that are the same, or where one dimension table contains a perfect subset of the attributes of another. Conformance requires that data values be identical, and that the same combination of attribute values is present in each table." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A dimension that is shared between two or more fact tables. It enables the integration of data from different fact tables at query time. This is a foundational principle that enables the longevity of a data warehousing environment. By using conformed dimensions, facts can be used together, aligned along these common dimensions. The beauty of using conformed dimensions is that facts that were designed independently of each other, perhaps over a number of years, can be integrated. The use of conformed dimensions is the central technique for building an enterprise data warehouse from a set of data marts." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"Two sets of business dimensions represented in dimension tables are said to be conformed if both sets are identical in their attributes or if one set is an exact subset of the other. Conformed dimensions are fundamental in the bus architecture for a family of STARS." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A dimension that means and represents the same thing when linked to different fact tables." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

01 February 2010

Data Warehousing: Cube Definitions)

"A subset of data, usually constructed from a data warehouse, that is organized and summarized into a multidimensional structure defined by a set of dimensions and measures. A cube's data is stored in one or more partitions." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"Name for a dimensional structure on a multidimensional or online analytical processing (OLAP) database platform, originally referring to the simple three-dimension case of product, market, and time." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed, 2002)

"Proprietary data structure used to store data for an online analytical processing (OLAP) end user data access and analysis tool." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A multidimensional data structure that represents the intersections of each unique combination of dimensions. At each intersection there is a cell that contains a data value." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"Used with online analytical processing (OLAP), data cubes are multidimensional structures built from one or more tables in a relational database(s)." (Sara Morganand & Tobias Thernstrom , "MCITP Self-Paced Training Kit : Designing and Optimizing Data Access by Using Microsoft SQL Server 2005 - Exam 70-442", 2007)

"A multidimensional structure that contains dimensions and measures." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A multidimensional structure that contains dimensions and measures. Cubes are a denormalized version of either the entire database or part of the database and are used within SQL Server Analysis Services (SSAS)." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A set of data that is organized and summarized into a multidimensional structure defined by a set of dimensions and measures." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"A database object that organizes data for accessibility in an OLAP database." (Ken Withee, "Microsoft® Business Intelligence For Dummies®", 2010)

"A multi-dimensional data structure that contains an aggregate value at each point, i.e., the result of applying an aggregate function to an underlying relation. Data cubes are used to implement OLAP." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Refers to the multidimensional data structure used to store and manipulate data in a multidimensional DBMS. The location of each data value in the data cube is based on the x-, y-, and z-axes of the cube. Data cubes are static (must be created before they are used), so they cannot be created by an ad hoc query." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed, 2011)

"A set of data that is organized and summarized into a multidimensional structure that is defined by a set of dimensions and measures." (Microsoft, "SQL Server 2012 Glossary", 2012)

"A multidimensional representation of data needed for online analytical processing, multidimensional reporting, or multidimensional planning applications." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"Cubes, also known as OLAP cubes, are preprocessed and presummarized collections of data that drastically improve query time. [...] OLAP cubes are logical structures as defined by the metadata." (Piethein Strengholt, "Data Management at Scale", 2020)

10 January 2010

Data Warehousing: Dimension Table (Definitions)

 "A table in a data warehouse whose entries describe data in a fact table. Dimension tables present business entities." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A table in a data warehouse whose entries describe data in a fact table. Dimension tables present business entities. A database object stored in a data warehouse containing information used to reference the data stored in a fact table." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A table in a star schema design that contains dimensional attributes and a surrogate key." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"The relational database table that contains information about each member of a dimension, such as its name, as well as other specific characteristics of each member." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A table that contains the data from which dimensions are created." (Jim Joseph et al, "Microsoft® SQL Server™ 2008 Reporting Services Unleashed", 2009)

"In the dimensional data model, each dimension table contains the attributes of a single business dimension. Product, store, salesperson, and promotional campaign are examples of business dimensions along which business measurements or facts are analyzed." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"The smaller tables used in a data warehouse to denote the attributes of a particular dimension, such as time, location, customer characteristics, product characteristics, etc." (Toby J Teorey, ", Database Modeling and Design" 4th Ed, 2010)

"A table in a data warehouse whose entries describe data in a fact table." (SQL Server 2012 Glossary, "Microsoft", 2012)

"The representation of a dimension in a star schema. Each row in a dimension table represents all of the attributes for a particular member of the dimension. See also star join, star schema." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

Data Warehousing: Dimension (Definitions)

"A structural attribute of a cube, which is an organized hierarchy of categories (levels) that describe data in the fact table. These categories typically describe a similar set of members upon which the user wants to base an analysis. For example, a geography dimension might include levels for Country, Region, State or Province, and City. See also level; measure." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A list of labels that can be used to cross-tabulate values from other dimensions." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"A shorthand term that is used to refer to a dimension table or a dimension attribute." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"A group of related objects within a cube that's used to provide information about related data. For example, a product dimension could include a product name, a product category, a product size, product cost, and product price." (Robert D. Schneider and Darril Gibson, "Microsoft SQL Server 2008 All-In-One Desk Reference For Dummies", 2008)

"A structural attribute of a cube upon which the user wants to base an analysis (for example, geography dimension). Dimension describes data in a fact table." (Jim Joseph, "Microsoft SQL Server 2008 Reporting Services Unleashed", 2009)

"Major business categories of information or groupings to describe business data. Dimensions contain information used for constraining queries, report headings, and defining drill paths. Within a dimension, specific attributes are the data elements that are used as row and column headers on reports. Dimensional attributes are also considered to be reference data. When describing the need to report information by region, by week, and by month, the attributes following ''by'' are dimensions. Each of these would be included in a dimension." (Laura Reeves, "A Manager's Guide to Data Warehousing", 2009)

"An aspect of data that provides a way to divide it in an OLAP database (for example, a carmaker's OLAP database may organize product data by the dimensions of model, body style, engine type, and price point)." (Ken Withee, "Microsoft® Business Intelligence For Dummies®", 2010)

"A slice of data used in analysis and reporting. For example, in a report that shows sales by customer and product for the year ending December 2009, "customer," "product," and "time" would be the dimensions used." (Janice M Roehl-Anderson, "IT Best Practices for Financial Managers", 2010)

"In a data warehouse, a data element that categorizes each item in a data set into nonoverlapping regions." (Craig S Mullins, "Database Administration", 2012)

"In multidimensional data, a structural attribute of a cube that organizes data to enable in-depth business analysis." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

10 April 2009

DBMS: Surrogate Key (Definitions)

"A unique identifier for a row within a database table. A surrogate, or candidate, key can be made up of one or more columns. By definition, every table must have at least one surrogate key (in which case it becomes the primary key for a table automatically). However, it is possible for a table to have more than one surrogate key (in which case one of them must be designated as the primary key). Any surrogate key that is not the primary key is called the alternate key." (Microsoft Corporation, "SQL Server 7.0 System Administration Training Kit", 1999)

"A primary key that is typically invisible to the end user. Normally, surrogate keys are used where end users have their own pre-existing identification schemes (such as an ISBN in a database of books), so the users can modify their existing identifiers." (Bill Pribyl & Steven Feuerstein, "Learning Oracle PL/SQL", 2001)

"Integer keys that are sequentially assigned as needed in the staging area to populate a dimension table and join to the fact table. In the dimension table, the surrogate key is the primary key. In the fact table, the surrogate key is a foreign key to a specific dimension and may be part of the fact table’s primary key, although this is not required. A surrogate key usually cannot be interpreted by itself. That is, it is not a smart key in any way. Surrogate keys are required in many data warehouse situations to handle slowly changing dimensions, as well as missing or inapplicable data. Also known as artificial keys, integer keys, meaningless keys, non-natural keys, and synthetic keys." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"A surrogate key is a substitute key that is usually an arbitrary numeric value assigned by the load process or the database system. The advantage of the surrogate key is that it can be structured so that it is always unique throughout the span of integration for the data warehouse." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"A single-part, artificially established identifier for an entity. Surrogate key assignment is a special case of derived data - one where the primary key is derived. A common way of deriving surrogate key values is to assign integer values sequentially." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

[artificial key:] "A system-generated, nonsignificant, surrogate identifier or globally unique identifier (GUID) used to uniquely identify a row in a table. This is also known as a surrogate key." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"A redundant, unique key generated for a record in a data warehouse table to allow integration of data from multiple source systems and to support changing data over time." (Reed Jacobsen & Stacia Misner, "Microsoft SQL Server 2005 Analysis Services Step by Step", 2006)

"The primary key column of a dimension table. The surrogate key is unique to the data warehouse. Key values have no intrinsic meaning, and are assigned as part of the ETL process. By avoiding the use of a natural key, the data warehouse is able to handle changes to operational data in a different manner from transaction systems. The use of a surrogate key also eliminates the need to join fact and dimension tables via multi-part keys." (Christopher Adamson, "Mastering Data Warehouse Aggregates", 2006)

"Used as a replacement or substitute for a descriptive primary key, allowing for better control, better structure, less storage space, more efficient indexing, and absolute surety of uniqueness. Surrogate keys are usually integers, and usually automatically generated using auto counters or sequences." (Gavin Powell, "Beginning Database Design", 2006)

"An artificial key field, usually with system-assigned sequential numbers, used in the dimensional model to link a dimension table to the fact table. In a dimension table, the surrogate key is the primary key which becomes a foreign key in the fact table." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010)

"A single-part, artificially established, physical identifier for a data set, usually not visible to business users, and used for database management and performance. Surrogate key assignment is a special case of derived data - one where the primary key is derived. A common way of deriving surrogate key values is to assign integer values sequentially. Sometimes referred to as a dummy key, sequential key, or auto-number field." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"A system-assigned primary key, generally numeric and auto-incremented." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.