31 December 2020

Graphical Representation: Pie Charts (An Introduction)

Graphical Representation

From business dashboards to newspapers and other forms of content that capture the attention of average readers, pie charts seem to be one of the most used forms of graphical representation. Unfortunately, their characteristics make them inappropriate for displaying certain types of data, and of being misused. Therefore, there are many voices who advice against using them for any form of display.

It’s hard to agree with radical statements like ‘avoid (using) pie charts’ or ’pie charts are bad’. Each form of graphical representation (aka graphical tool, graphic) has advantages and disadvantages, which makes it appropriate or inappropriate for displaying data having certain characteristics. In addition, each tool can be easily misused, especially when basic representational practices are ignored. Avoiding one representational tool doesn’t mean that the use of another tool will be correct. Therefore, it’s important to make people aware of these aspects and let them decide which tools they should use. 

From a graphical tool is expected to represent and describe a dataset in a small area without distorting the reality, while encouraging the reader to compare the different pieces of information, when possible at different levels of details [1] or how they change over time. As form of communication, they encode information and meaning; the reader needs to be able to read, understand and think critically about graphics and data – what is known as graphical/data literacy.

A pie chart consists of a circle split into wedge-shaped slices (aka edges, segments), each slice representing a group or category (aka component). It resembles with the spokes of a wheel, however with a few exceptions they are seldom equidistant. The size of each slice is proportional to the percentage of the component when compared to the whole. Therefore, pie charts are ideal when displaying percentages or values that can be converted into percentages. Thus, the percentages must sum up to 100% (at least that’s readers’ expectation).

Within or besides the slices are displayed components’ name and sometimes the percentages or other numeric or textual information associated with them (Fig. 1-4).  The percentages become important when the slices seem to be of equal sizes. As long the slices have the same radius, comparison of the different components resumes in comparing arcs of circles or the chords defined by them, thing not always straightforward. 3-dimensional displays can upon case make the comparison more difficult.

Pie Chart Examples

The comparison increases in difficulty with the number of slices increases beyond a certain number. Therefore, it’s not recommended displaying more than 5-10 components within the same chart. If the components exceed this limit, the exceeding components can be summed up within an “other” component. 

Within a graphic one needs a reference point that can be used as starting point for exploration. Typically for categorical data this reference point is the biggest or the smallest value, the other values being sorted in ascending, respectively descending order, fact that facilitates comparing the values. For pie charts, this would mean sorting the slices based on their sizes, except the slice for “others” which is typically considered last.

The slices can be filled optionally with meaningful colors or (hashing) patterns. When the same color pallet is used, the size can be reflected in colors’ hue, however this can generate confusion when not applied adequately. It’s recommended to provide further (textual) information when the graphical elements can lead to misinterpretations. 

Pie charts can be used occasionally for comparing the changes of the same components between different points in time, geographies (Fig. 5-6) or other types of segmentation. Having the charts displayed besides each other and marking each component with a characteristic color or pattern facilitate the comparison. 

Pie Charts - Geographies

30 December 2020

Data Warehousing: ETL (The Transform Subprocess)

Data Warehousing

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed. 

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap. 

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings). 

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process. 

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes.  Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed. 

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best. 

29 December 2020

Project Management: Project Planning (Just the Quotes)

"Planning starts usually with something like a general idea. For one reason or another it seems desirable to reach a certain objective, and how to reach it is frequently not too clear. The first step then is to examine the idea carefully in the light of the means available. Frequently more fact-finding about the situation is required. If this first period of planning is successful, two items emerge: namely, an 'over-all plan' of how to reach the objective and secondly, a decision in regard to the first step of action. Usually this planning has also somewhat modified the original idea. The next period is devoted to executing the first step of the original plan." (Kurt Lewin, "Action research and minority problems", 1946)

"Every company has beloved projects on which if prices had held up, if the contractors had finished on time (or finished at all), if the plans hadn't been altered, if the thing had actually worked, the planned return would have been earned. But since some or all of these calamities [things that don't go as expected] usually happen, any manager who neglects to allow for them is not planning - merely thinking wishfully. Desire for the project has, as usual, overtaken desire for profit." (Ernest Dale, "Planning and developing the company organization structure", 1952)

"Since software construction is inherently a systems effort - an exercise in complex interrelationships - communication effort is great, and it quickly dominates the decrease in individual task time brought about by partitioning [increasing the workers]. Adding more people then lengthens, not shortens, the schedule." (Frederick Brook, "The Mythical Man-Month", 1975)

"Because one has to be an optimist to begin an ambitious project, it is not surprising that underestimation of completion time is the norm." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"If we decide to plan not to lose, we take a defensive posture in which we expend huge amounts of effort trying to prevent and track errors. This will lead us to a very heavyweight planning process in which we try to plan everything up front in a much detail as possible. Such a process will have many review steps, sign-offs, authorizations, and phase gates. Such a planning process is highly adept at making sure that blame can be assigned when something fails; but takes no direct steps towards making sure that the right system is delivered at a reasonable cost." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"One of the purposes of planning is we always want to work on the most valuable thing possible at any given time. We can’t pick features at random and expect them to be most valuable. We have to begin development by taking a quick look at everything that might be valuable, putting all our cards on the table. At the beginning of each iteration the business (remember the balance of power) will pick the most valuable features for the next iteration." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"Planning is not about predicting the future. When you make a plan for developing a piece of software, development is not going to go like that. Not ever. Your customers wouldn’t even be happy if it did, because by the time software gets there, the customers don’t want what was planned, they want something different." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"Projects sometimes fail long before they deliver anything. At some point they may be determined to be too expensive to continue. Or perhaps they took too long to develop and the business need evaporated. Or perhaps the requirements change so often that the developers can never finish one thing without having to stop and start all over on something new. Certainly these are planning failures." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"There are two ways to approach prevention of these planning failures. We can plan not to lose, or we can plan to win. The two are not identical. Planning not to lose is defensive; while planning to win is aggressive. [...] the problem that planning is supposed to solve is simply, to build the right system at the right cost. If we take a defensive posture by planning not to lose, we will be able to hold people accountable for any failures; but at an enormous cost. If we take an aggressive posture and plan to win, we will be unafraid to make errors, and will continuously correct them to meet our goals.(Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"We plan because: We need to ensure that we are always working on the most important thing we need to do. We need to coordinate with other people. When unexpected events occur we need to understand the consequences for the first two." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"When we plan to win we take direct steps to ensure that we are building the right system at the best possible cost. Every action we take goes towards that end. Instead of trying to plan everything up front, we plan just the next few steps; and then allow customer feedback to correct our trajectory. In this way, we get off the mark quickly, and then continuously correct our direction. Errors are unimportant because they will be corrected quickly." (Kent Beck & Martin Fowler, "Planning Extreme Programming", 2000)

"Project planning is the key to effective project management. Detailed and accurate planning of a project produces the managerial information that is the basis of project justification (costs, benefits, strategic impact, etc.) and the defining of the business drivers (scope, objectives) that form the context for the technical solution. In addition, project planning also produces the project schedules and resource allocations that are the framework for the other project management processes: tracking, reporting, and review." (Rob Thomsett, "Radical Project Management", 2002)

"And even if we make good plans based on the best information available at the time and people do exactly what we plan, the effects of our actions may not be the ones we wanted because the environment is nonlinear and hence is fundamentally unpredictable. As time passes the situation will change, chance events will occur, other agents such as customers or competitors will take actions of their own, and we will find that what we do is only one factor among several which create a new situation." (Stephen Bungay, "The Art of Action: How Leaders Close the Gaps between Plans, Actions, and Results", 2010)

28 December 2020

Data Warehousing: ETL (The Load Subprocess)

Data Warehousing

As part of the ETLprocess, the Load subprocess is responsible for loading the data into the destination table(s). It covers in theory the final steps from the data pipeline and in most of the cases it matches the definition of the query used for data extraction, though this depends also on the transformations used in the solution.

A commonly used approach is dumping the data into an intermediary table from the staging area, table with no constraints that matches only the data types from the source. Once the data loaded, they are further copied into the production table. This approach allows minimizing the unavailability of the production table as the load from an external data source normally takes longer than copying the data within the same database or instance. That might not be the case when the data are available in the same data center, however loading the data first in a staging table facilitates troubleshooting and testing. This approach allows also dropping the indexes on the production table before loading the data and recreating them afterwards. In practice, this proves to be an efficient method for improving data loads’ efficiency.

In general, it’s recommended to import the data 1:1 compared with the source query, though the transformations used can increase or decrease the number of attributes considered. The recommendation applies as well to the cases in which data come from different sources, primarily to separate the pipelines, as systems can have different refreshing requirements and other constraints.

One can consider adding a timestamp reflecting the refresh date and upon case also additional metadata (e.g. identifier for source system, unique identifier for the record). The timestamp is especially important when the data are imported incrementally - only the data created since the last load are loaded. Except the unique identifier, these metadata can however be saved also in a separate table, with the same granularity as the table (1:1) or one record for each load per table and system, storing a reference to the respective record into the load table. There are seldom logical argumentations for using the former approach, while the latter works well when the metadata are used only for auditing purposes. If the metadata are needed in further data processing and performance is important, then the metadata can be considered directly in the load table(s).

A special approach is considered by the Data Vault methodology for Data Warehousing which seems to gain increasing acceptance, especially to address the various compliance requirements for tracking the change in records at most granular level. To achieve this the fact and dimension tables are split into several tables – the hub tables store the business keys together with load metadata, the link tables store the relationships between business keys, while satellite tables store the descriptions of the business keys (the other attributes except the business key) and reference tables store the dropdown values. Besides table’s denormalization there are several other constraints that apply. The denormalization of the data over multiple tables can increase the overall complexity and come with performance penalties, as more tables need to be joined, however it might be the price to pay if traceability and auditability are a must.

There are scenarios in which the requirements for the ETL packages are driven by the target (load) tables – the format is already given - one needing thus to accommodate the data into the existing tables or extended the respective tables to accommodate more attributes. It’s the case for load tables storing data from multiple systems with similar purpose (e.g. financial data from different ERP systems needed for consolidations).

27 December 2020

Data Warehousing: ETL (The Extract Subprocess)


Data Warehousing

As part of the ETL process with applicability to Data Warehousing, Data Migrations, Data Integrations or similar scenarios the extraction subprocess is responsible for preparing and implementing the logic required to extract the data from the various source systems at the required level of detail. The extraction is done typically based on SQL queries as long one deals with relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files.

One can consider the preparation of the extraction logic as separate design subprocess of the targeted solution. Even if high-level design decisions are considered at the respective level, the low-level design needs to be considered at ETL package level. As part of the process are identified the source of the data in terms of system, tables and attributes to be imported, as well the joins, business and transformation rules that need to be applied on the data. This can involve reengineering the logic from the source system(s) as well data profiling, discovery or exploration activities.

A common practice is to copy the source tables 1:1 into the solution, eventually by considering only the needed attributes to minimize the necessary space, loading time and content’s complexity, even if this would add more effort into the design phase to identify only the needed attributes. If further attributes are identified at a later stage, the packages need to be modified accordingly. If the data volume or the number of unnecessary attributes is neglectable, copying the table 1:1 could prove to be the best strategy.

A second approach is to model within the extraction the (business) entity as designed within the source system. For example, the entity could be split over multiple tables from design or other purposes. Thus, the extraction query will attempt modeling the entity. This approach reduces to some degree the number of tables from the targeted solution, as well the number of ETL packages involved, while providing a clear depiction of the entities involved.

A third approach is to extract the data as needed by the target system, eventually as a mix between master and transaction data, fact which could easily lead to data redundancy with different timeliness and all the consequences resulting from this. This approach is usually met in solutions which require fast data availability in the detriment of design.

Unfortunately, there can be design constraints or choice considerations that could lead to a mix between these approaches. If the impact caused by the mix between the first two approaches is minimal, the third approach can cause more challenges, though it might be a small price to pay as long the considered data are disconnected from other data.

To reduce the redundancy of data, it’s recommended to consider as goal creating a unique source of facts, which can be obtained by minimizing as much as possible the overlaps between tables, respectively entities. Ideally there should be no overlaps. On the other sides the overlaps can be acceptable when the same data are available in more systems and the solution requires all the data to be available.

If the above approaches consider the vertical partitioning of the data, there can be also horizontal partitioning needs especially when a subset of the data is needed or when is needed to partition the data based on a set of values. In addition, one might be forced to include also transformation rules directly into the extraction logic, for example to handle conversion issues or minimize certain design overhead early in the process. In practice it makes sense to link such choices to business rules and document them accordingly.

Data Warehousing: Data Vault 2.0 (The Good, the Bad and the Ugly)

Data Warehousing

One of the interesting concepts that seems to gain adepts in Data Warehousing is the Data Vault – a methodology, architecture and implementation for Data Warehouses (DWH) developed by Dan Linstedt between 1990 and 2000, and evolved into an open standard with the 2.0 version.

According to its creator, the Data Vault is a detail-oriented, historical tracking and uniquely linked set of normalized tables that support one or more business functional areas [2]. To hold data at the lowest grain of detail from the source system(s) and track the changes occurred in the data, it splits the fact and dimension tables into hubs (business keys), links (the relationships between business keys), satellites (descriptions of the business keys), and reference (dropdown values) tables [3], while adopting a hybrid approach between 3rd normal form and star schemas. In addition, it provides a two- or three-layered data integration architecture, a series of standards, methods and best practices supposed to facilitate its use.

It integrates several other methodologies that allow bridging the gap between the technical, logistic and execution parts of the DWH life-cycle – the PMI methodology is used for the various levels of planning and execution, while the Scrum methodology is used for coordinating the day-to-day project tasks. Six Sigma is used together with Total Quality Management for the design and continuous improvement of DWH and data-related processes. In addition, it follows the CMMI maturity model for providing a clear baseline for benchmarking an organization’s DWH capabilities in development, acquisition and service areas.

The Good: The decomposition of the source data models into hub, link and satellite tables provides traceability and auditability at raw data level, allowing thus to address the compliance requirements of Sarabanes-Oxley, HIPPA and Basel II by design.

The considered standards, methods, principles and best practices are leveraged from Software Engineering [1], establishing common ground and a standardized approach to DWH design, implementation and testing. It also narrows down the learning and implementation paths, while allowing an incremental approach to the various phases.

Data Vault 2.0 offers support for real-time, near-real-time and unstructured data, while new technologies like MapReduce, NoSQL can be integrated within its architecture, though the same can be said about other approaches as long there’s compatibility between the considered technologies. In fact, except business entities’ decomposition, many of the notions used are common to DWH design.

The Bad: Further decomposing the fact and dimension tables can impact the performance of the queries run against the tables as more joins are required to gather the data from the various tables. The further denormalization of tables can lead to higher data storage needs, though this can be neglectable compared with the volume of additional objects that need to be created in DWH. For an ERP system with a few hundred of meaningful tables the complexity can become overwhelming.

Unless one uses a COTS tool which automates some part of the design and creation process, building everything from scratch can be time-consuming, increasing thus the time-to-market for solutions. However, the COTS tools can introduce restrictions of their own, which can negatively impact the overall experience with the methodology.

The incorporation of non-technical methodologies can have positive impact, though unless one has experience with the respective methodologies, the disadvantages can easily overshadow the (theoretical) advantages.

The Ugly: The dangers of using Data Vault can be corroborated as usual with the poor understanding of the methodology, poor level of skillset or the attempt of implementing the methodology without allowing some flexibility when required. Unless one knows what he is doing, bringing more complexity in a field which is already complex, can easily impact negatively projects’ outcomes.

[1] Dan Linstedt & Michael Olschimke (2015) Building a Scalable Data Warehouse with Data Vault 2.0
[2] Dan Linstedt (?) Data Vault Basics [source]
[3] Dan Linstedt (2018) Data Vault: Data Modeling Specification v 2.0.2 [source]

20 December 2020

IT: Computers (Just the Quotes)

"Let it be remarked [...] that an important difference between the way in which we use the brain and the machine is that the machine is intended for many successive runs, either with no reference to each other, or with a minimal, limited reference, and that it can be cleared between such runs; while the brain, in the course of nature, never even approximately clears out its past records. Thus the brain, under normal circumstances, is not the complete analogue of the computing machine but rather the analogue of a single run on such a machine." (Norbert Wiener, "Cybernetics: Or Control and Communication in the Animal and the Machine", 1948)

"A computer would deserve to be called intelligent if it could deceive a human into believing that it was human." (Alan Turing, "Computing Machinery and Intelligence" , Mind Vol. 59, 1950)

"A computer is a person or machine that is able to take in information (problems and data), perform reasonable operations on the iformation, and put out answers. A computer is identified by the fact that it (or he) handles information reasonably." (Edmund C Berkeley & Lawrence Wainwright, Computers: Their Operation and Applications", 1956)

"An information retrieval system is therefore defined here as any device which aids access to documents specified by subject, and the operations associated with it. The documents can be books, journals, reports, atlases, or other records of thought, or any parts of such records - articles, chapters, sections, tables, diagrams, or even particular words. The retrieval devices can range from a bare list of contents to a large digital computer and its accessories. The operations can range from simple visual scanning to the most detailed programming." (Brian C Vickery, "The Structure of Information Retrieval Systems", 1959)

"Computers do not decrease the need for mathematical analysis, but rather greatly increase this need. They actually extend the use of analysis into the fields of computers and computation, the former area being almost unknown until recently, the latter never having been as intensively investigated as its importance warrants. Finally, it is up to the user of computational equipment to define his needs in terms of his problems, In any case, computers can never eliminate the need for problem-solving through human ingenuity and intelligence." (Richard E Bellman & Paul Brock, "On the Concepts of a Problem and Problem-Solving", American Mathematical Monthly 67, 1960)

"There is the very real danger that a number of problems which could profitably be subjected to analysis, and so treated by simpler and more revealing techniques. will instead be routinely shunted to the computing machines [...] The role of computing machines as a mathematical tool is not that of a panacea for all computational ills." (Richard E Bellman & Paul Brock, "On the Concepts of a Problem and Problem-Solving", American Mathematical Monthly 67, 1960)

"These machines have no common sense; they have not yet learned to 'think', and they do exactly as they are told, no more and no less. This fact is the hardest concept to grasp when one first tries to use a computer." (Donald Knuth, "The Art of Computer Programming, Volume 1: Fundamental Algorithms", 1968)

 "Because the subject matter of cybernetics is the propositional or informational aspect of the events and objects in the natural world, this science is forced to procedures rather different from those of the other sciences. The differentiation, for example, between map and territory, which the semanticists insist that scientists shall respect in their writings must, in cybernetics, be watched for in the very phenomena about which the scientist writes. Expectably, communicating organisms and badly programmed computers will mistake map for territory; and the language of the scientist must be able to cope with such anomalies." (Gregory Bateson, "Steps to an Ecology of Mind", 1972)

"Computers can do better than ever what needn't be done at all. Making sense is still a human monopoly." (Marshall McLuhan, "Take Today: The Executive as Dropout", 1972)

"Everything we think we know about the world is a model. Every word and every language is a model. All maps and statistics, books and databases, equations and computer programs are models. So are the ways I picture the world in my head - my mental models. None of these is or ever will be the real world. […] Our models usually have a strong congruence with the world. That is why we are such a successful species in the biosphere. Especially complex and sophisticated are the mental models we develop from direct, intimate experience of nature, people, and organizations immediately around us." (Donella Meadows, "Limits to Growth", 1972)

"It follows from this that man's most urgent and pre-emptive need is maximally to utilize cybernetic science and computer technology within a general systems framework, to build a meta-systemic reality which is now only dimly envisaged. Intelligent and purposeful application of rapidly developing telecommunications and teleprocessing technology should make possible a degree of worldwide value consensus heretofore unrealizable." (Richard F Ericson, "Visions of Cybernetic Organizations", 1972)

"The mind is defined as the sum total of all the programs and the metaprograms of a given human computer, whether or not they are immediately elicitable, detectable, and visibly operational to the self or to others." (John C Lilly "Programming and Metaprogramming in the Human Biocomputer" 2nd Ed., 1972)

"The pseudo approach to uncertainty modeling refers to the use of an uncertainty model instead of using a deterministic model which is actually (or at least theoretically) available. The uncertainty model may be desired because it results in a simpler analysis, because it is too difficult (expensive) to gather all the data necessary for an exact model, or because the exact model is too complex to be included in the computer." (Fred C Scweppe, "Uncertain dynamic systems", 1973)

"Computers make possible an entirely new relationship between theories and models. I have already said that theories are texts. Texts are written in a language. Computer languages are languages too, and theories may be written in them. Indeed, for the present purpose we need not restrict our attention to machine languages or even to the kinds of 'higher-level' languages we have discussed. We may include all languages, specifically also natural languages, that computers may be able to interpret. The point is precisely that computers do interpret texts given to them, in other words, that texts determine computers' behavior. Theories written in the form of computer programs are ordinary theories as seen from one point of view." (Joseph Weizenbaum, "Computer power and human reason: From judgment to calculation" , 1976)

"Man is not a machine, [...] although man most certainly processes information, he does not necessarily process it in the way computers do. Computers and men are not species of the same genus. [...] No other organism, and certainly no computer, can be made to confront genuine human problems in human terms. [...] However much intelligence computers may attain, now or in the future, theirs must always be an intelligence alien to genuine human problems and concerns." (Joesph Weizenbaum, Computer Power and Human Reason: From Judgment to Calculation, 1976)

"The connection between a model and a theory is that a model satisfies a theory; that is, a model obeys those laws of behavior that a corresponding theory explicitly states or which may be derived from it. [...] Computers make possible an entirely new relationship between theories and models. [...] A theory written in the form of a computer program is [...] both a theory and, when placed on a computer and run, a model to which the theory applies." (Joseph Weizenbaum, "Computer power and human reason: From judgment to calculation" , 1976)

"It is essential to realize that a computer is not a mere 'number cruncher', or supercalculating arithmetic machine, although this is how computers are commonly regarded by people having no familiarity with artificial intelligence. Computers do not crunch numbers; they manipulate symbols. [...] Digital computers originally developed with mathematical problems in mind, are in fact general purpose symbol manipulating machines." (Margaret A Boden, "Minds and mechanisms", 1981)

"The basic idea of cognitive science is that intelligent beings are semantic engines - in other words, automatic formal systems with interpretations under which they consistently make sense. We can now see why this includes psychology and artificial intelligence on a more or less equal footing: people and intelligent computers (if and when there are any) turn out to be merely different manifestations of the same underlying phenomenon. Moreover, with universal hardware, any semantic engine can in principle be formally imitated by a computer if only the right program can be found." (John Haugeland, "Semantic Engines: An introduction to mind design", 1981)

"Computers and robots replace humans in the exercise of mental functions in the same way as mechanical power replaced them in the performance of physical tasks. As time goes on, more and more complex mental functions will be performed by machines. Any worker who now performs his task by following specific instructions can, in principle, be replaced by a machine. This means that the role of humans as the most important factor of production is bound to diminish - in the same way that the role of horses in agricultural production was first diminished and then eliminated by the introduction of tractors."  (Wassily Leontief, National perspective: The definition of problem and opportunity, 1983)

"If arithmetical skill is the measure of intelligence, then computers have been more intelligent than all human beings all along. If the ability to play chess is the measure, then there are computers now in existence that are more intelligent than any but a very few human beings. However, if insight, intuition, creativity, the ability to view a problem as a whole and guess the answer by the “feel” of the situation, is a measure of intelligence, computers are very unintelligent indeed. Nor can we see right now how this deficiency in computers can be easily remedied, since human beings cannot program a computer to be intuitive or creative for the very good reason that we do not know what we ourselves do when we exercise these qualities." (Isaac Asimov, "Machines That Think", 1983)

"The digital-computer field defined computers as machines that manipulated numbers. The great thing was, adherents said, that everything could be encoded into numbers, even instructions. In contrast, scientists in AI [artificial intelligence] saw computers as machines that manipulated symbols. The great thing was, they said, that everything could be encoded into symbols, even numbers." (Allen Newell, "Intellectual Issues in the History of Artificial Intelligence", 1983)

"Computation offers a new means of describing and investigating scientific and mathematical systems. Simulation by computer may be the only way to predict how certain complicated systems evolve." (Stephen Wolfram, "Computer Software in Science and Mathematics", 1984)

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do." (Donald E Knuth, "Literate Programming", 1984)

"Scientific laws give algorithms, or procedures, for determining how systems behave. The computer program is a medium in which the algorithms can be expressed and applied. Physical objects and mathematical structures can be represented as numbers and symbols in a computer, and a program can be written to manipulate them according to the algorithms. When the computer program is executed, it causes the numbers and symbols to be modified in the way specified by the scientific laws. It thereby allows the consequences of the laws to be deduced." (Stephen Wolfram, "Computer Software in Science and Mathematics", 1984)

"The trouble with an analog computer is that one begins to construct mathematical models which can be treated using an analog computer. In many cases this is not realistic." (Richard E Bellman, "Eye of the Hurricane: An Autobiography", 1984)

"Under pressure from the computer, the question of mind in relation to machine is becoming a central cultural preoccupation." (Sherry Turkle, "The Second Self: Computers and the Human Spirit", 1984)

"A computer is an interpreted automatic formal system - that is to say, a symbol-manipulating machine." (John Haugeland, "Artificial intelligence: The very idea", 1985)

"Artificial intelligence is based on the assumption that the mind can be described as some kind of formal system manipulating symbols that stand for things in the world. Thus it doesn't matter what the brain is made of, or what it uses for tokens in the great game of thinking. Using an equivalent set of tokens and rules, we can do thinking with a digital computer, just as we can play chess using cups, salt and pepper shakers, knives, forks, and spoons. Using the right software, one system (the mind) can be mapped onto the other (the computer)." (George Johnson, Machinery of the Mind: Inside the New Science of Artificial Intelligence, 1986)

"Just like a computer, we must remember things in the order in which entropy increases. This makes the second law of thermodynamics almost trivial. Disorder increases with time because we measure time in the direction in which disorder increases."  (Stephen Hawking, "A Brief History of Time", 1988)

"Cybernetics is simultaneously the most important science of the age and the least recognized and understood. It is neither robotics nor freezing dead people. It is not limited to computer applications and it has as much to say about human interactions as it does about machine intelligence. Today’s cybernetics is at the root of major revolutions in biology, artificial intelligence, neural modeling, psychology, education, and mathematics. At last there is a unifying framework that suspends long-held differences between science and art, and between external reality and internal belief." (Paul Pangaro, "New Order From Old: The Rise of Second-Order Cybernetics and Its Implications for Machine Intelligence", 1988)

"A popular myth says that the invention of the computer diminishes our sense of ourselves, because it shows that rational thought is not special to human beings, but can be carried on by a mere machine. It is a short stop from there to the conclusion that intelligence is mechanical, which many people find to be an affront to all that is most precious and singular about their humanness." (Jeremy Campbell, "The improbable machine", 1989)

"Fuzziness, then, is a concomitant of complexity. This implies that as the complexity of a task, or of a system for performing that task, exceeds a certain threshold, the system must necessarily become fuzzy in nature. Thus, with the rapid increase in the complexity of the information processing tasks which the computers are called upon to perform, we are reaching a point where computers will have to be designed for processing of information in fuzzy form. In fact, it is the capability to manipulate fuzzy concepts that distinguishes human intelligence from the machine intelligence of current generation computers. Without such capability we cannot build machines that can summarize written text, translate well from one natural language to another, or perform many other tasks that humans can do with ease because of their ability to manipulate fuzzy concepts." (Lotfi A Zadeh, "The Birth and Evolution of Fuzzy Logic", 1989)

 "Looking at ourselves from the computer viewpoint, we cannot avoid seeing that natural language is our most important 'programming language'. This means that a vast portion of our knowledge and activity is, for us, best communicated and understood in our natural language. [...] One could say that natural language was our first great original artifact and, since, as we increasingly realize, languages are machines, so natural language, with our brains to run it, was our primal invention of the universal computer. One could say this except for the sneaking suspicion that language isn’t something we invented but something we became, not something we constructed but something in which we created, and recreated, ourselves. (Justin Leiber, "Invitation to cognitive science", 1991)

"The cybernetics phase of cognitive science produced an amazing array of concrete results, in addition to its long-term (often underground) influence: the use of mathematical logic to understand the operation of the nervous system; the invention of information processing machines (as digital computers), thus laying the basis for artificial intelligence; the establishment of the metadiscipline of system theory, which has had an imprint in many branches of science, such as engineering (systems analysis, control theory), biology (regulatory physiology, ecology), social sciences (family therapy, structural anthropology, management, urban studies), and economics (game theory); information theory as a statistical theory of signal and communication channels; the first examples of self-organizing systems. This list is impressive: we tend to consider many of these notions and tools an integrative part of our life […]" (Francisco Varela, "The Embodied Mind", 1991)

"A computer terminal is not some clunky old television with a typewriter in front of it. It is an interface where the mind and body can connect with the universe and move bits of it about." (Douglas N Adams, "Mostly Harmless", 1992)

"Finite Nature is a hypothesis that ultimately every quantity of physics, including space and time, will turn out to be discrete and finite; that the amount of information in any small volume of space-time will be finite and equal to one of a small number of possibilities. [...] We take the position that Finite Nature implies that the basic substrate of physics operates in a manner similar to the workings of certain specialized computers called cellular automata." (Edward Fredkin, "A New Cosmogony", PhysComp ’92: Proceedings of the Workshop on Physics and Computation, 1993)

"The insight at the root of artificial intelligence was that these 'bits' (manipulated by computers) could just as well stand as symbols for concepts that the machine would combine by the strict rules of logic or the looser associations of psychology." (Daniel Crevier, "AI: The tumultuous history of the search for artificial intelligence", 1993)

"At first glance the theory of numbers is deprived of any geometricity. But this is actually not the case. At the contemporary stage of development of computers it has become possible to explain to a wide range of readers that visual geometry helps not only to illustrate some abstract situations from the number theory, but sometimes also to solve new problems." (Anatolij Fomenko, "Visual Geometry and Topology", 1994)

"On the other hand, those who design and build computers know exactly how the machines are working down in the hidden depths of their semiconductors. Computers can be taken apart, scrutinized, and put back together. Their activities can be tracked, analyzed, measured, and thus clearly understood - which is far from possible with the brain. This gives rise to the tempting assumption on the part of the builders and designers that computers can tell us something about brains, indeed, that the computer can serve as a model of the mind, which then comes to be seen as some manner of information processing machine, and possibly not as good at the job as the machine. (Theodore Roszak, "The Cult of Information", 1994)

"Self-organization refers to the spontaneous formation of patterns and pattern change in open, nonequilibrium systems. […] Self-organization provides a paradigm for behavior and cognition, as well as the structure and function of the nervous system. In contrast to a computer, which requires particular programs to produce particular results, the tendency for self-organization is intrinsic to natural systems under certain conditions." (J A Scott Kelso, "Dynamic Patterns : The Self-organization of Brain and Behavior", 1995)

"Representation is the process of transforming existing problem knowledge to some of the known knowledge-engineering schemes in order to process it by applying knowledge-engineering methods. The result of the representation process is the problem knowledge base in a computer format." (Nikola K Kasabov, "Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering", 1996)

"Shearing away detail is the very essence of model building. Whatever else we require, a model must be simpler than the thing modeled. In certain kinds of fiction, a model that is identical with the thing modeled provides an interesting device, but it never happens in reality. Even with virtual reality, which may come close to this literary identity one day, the underlying model obeys laws which have a compact description in the computer - a description that generates the details of the artificial world." (John H Holland, "Emergence" , Philosophica 59, 1997)

"Modelling techniques on powerful computers allow us to simulate the behaviour of complex systems without having to understand them.  We can do with technology what we cannot do with science.  […] The rise of powerful technology is not an unconditional blessing.  We have  to deal with what we do not understand, and that demands new  ways of thinking." (Paul Cilliers,"Complexity and Postmodernism: Understanding Complex Systems", 1998)

"For most problems found in mathematics textbooks, mathematical reasoning is quite useful. But how often do people find textbook problems in real life? At work or in daily life, factors other than strict reasoning are often more important. Sometimes intuition and instinct provide better guides; sometimes computer simulations are more convenient or more reliable; sometimes rules of thumb or back-of-the-envelope estimates are all that is needed." (Lynn A Steen,"Twenty Questions about Mathematical Reasoning", 1999)

"Once a computer achieves human intelligence it will necessarily roar past it." (Ray Kurzweil, "The Age of Spiritual Machines: When Computers Exceed Human Intelligence", 1999)

"The classic example of chaos at work is in the weather. If you could measure the positions and motions of all the atoms in the air at once, you could predict the weather perfectly. But computer simulations show that tiny differences in starting conditions build up over about a week to give wildly different forecasts. So weather predicting will never be any good for forecasts more than a few days ahead, no matter how big (in terms of memory) and fast computers get to be in the future. The only computer that can simulate the weather is the weather; and the only computer that can simulate the Universe is the Universe." (John Gribbin, "The Little Book of Science", 1999)

"Conventional wisdom, fooled by our misleading 'physical intuition', is that the real world is continuous, and that discrete models are necessary evils for approximating the 'real' world, due to the innate discreteness of the digital computer." (Doron Zeilberger, "'Real' Analysis is a Degenerate Case of Discrete Analysis", 2001)

"The randomness of the card-shuffle is of course caused by our lack of knowledge of the precise procedure used to shuffle the cards. But that is outside the chosen system, so in our practical sense it is not admissible. If we were to change the system to include information about the shuffling rule – for example, that it is given by some particular computer code for pseudo-random numbers, starting with a given ‘seed value’ – then the system would look deterministic. Two computers of the same make running the same ‘random shuffle’ program would actually produce the identical sequence of top cards."(Ian Stewart, "Does God Play Dice: The New Mathematics of Chaos", 2002)

"There are endless examples of elaborate structures and apparently complex processes being generated through simple repetitive rules, all of which can be easily simulated on a computer. It is therefore tempting to believe that, because many complex patterns can be generated out of a simple algorithmic rule, all complexity is created in this way." (F David Peat, "From Certainty to Uncertainty", 2002)

"We build models to increase productivity, under the justified assumption that it's cheaper to manipulate the model than the real thing. Models then enable cheaper exploration and reasoning about some universe of discourse. One important application of models is to understand a real, abstract, or hypothetical problem domain that a computer system will reflect. This is done by abstraction, classification, and generalization of subject-matter entities into an appropriate set of classes and their behavior." (Stephen J Mellor, "Executable UML: A Foundation for Model-Driven Architecture", 2002) 

"Things are changing. Statisticians now recognize that computer scientists are making novel contributions while computer scientists now recognize the generality of statistical theory and methodology. Clever data mining algorithms are more scalable than statisticians ever thought possible. Formal statistical theory is more pervasive than computer scientists had realized." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Computers bootstrap their own offspring, grow so wise and incomprehensible that their communiqués assume the hallmarks of dementia: unfocused and irrelevant to the barely-intelligent creatures left behind. And when your surpassing creations find the answers you asked for, you can't understand their analysis and you can't verify their answers. You have to take their word on faith." (Peter Watts, "Blindsight", 2006)

"In specific cases, we think by applying mental rules, which are similar to rules in computer programs. In most of the cases, however, we reason by constructing, inspecting, and manipulating mental models. These models and the processes that manipulate them are the basis of our competence to reason. In general, it is believed that humans have the competence to perform such inferences error-free. Errors do occur, however, because reasoning performance is limited by capacities of the cognitive system, misunderstanding of the premises, ambiguity of problems, and motivational factors. Moreover, background knowledge can significantly influence our reasoning performance. This influence can either be facilitation or an impedance of the reasoning process." (Carsten Held et al, "Mental Models and the Mind", 2006)

"A neural network is a particular kind of computer program, originally developed to try to mimic the way the human brain works. It is essentially a computer simulation of a complex circuit through which electric current flows." (Keith J Devlin & Gary Lorden, "The Numbers behind NUMB3RS: Solving crime with mathematics", 2007)

"The burgeoning field of computer science has shifted our view of the physical world from that of a collection of interacting material particles to one of a seething network of information. In this way of looking at nature, the laws of physics are a form of software, or algorithm, while the material world - the hardware - plays the role of a gigantic computer." (Paul C W Davies, "Laying Down the Laws", New Scientist, 2007)

"We tend to form mental models that are simpler than reality; so if we create represented models that are simpler than the actual implementation model, we help the user achieve a better understanding. […] Understanding how software actually works always helps someone to use it, but this understanding usually comes at a significant cost. One of the most significant ways in which computers can assist human beings is by putting a simple face on complex processes and situations. As a result, user interfaces that are consistent with users’ mental models are vastly superior to those that are merely reflections of the implementation model." (Alan Cooper et al,  "About Face 3: The Essentials of Interaction Design", 2007)

"An algorithm refers to a successive and finite procedure by which it is possible to solve a certain problem. Algorithms are the operational base for most computer programs. They consist of a series of instructions that, thanks to programmers’ prior knowledge about the essential characteristics of a problem that must be solved, allow a step-by-step path to the solution." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"We evolved to be good at learning and using rules of thumb, not at searching for ultimate causes and making fine distinctions. Still less did we evolve to spin out long chains of calculation that connect fundamental laws to observable consequences. Computers are much better at it!" (Frank Wilczek,"The Lightness of Being – Mass, Ether and the Unification of Forces", 2008) 

"Chess, as a game of zero sum and total information is, theoretically, a game that can be solved. The problem is the immensity of the search tree: the total number of positions surpasses the number of atoms in our galaxy. When there are few pieces on the board, the search space is greatly reduced, and the problem becomes trivial for computers’ calculation capacity." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"Generally, these programs fall within the techniques of reinforcement learning and the majority use an algorithm of temporal difference learning. In essence, this computer learning paradigm approximates the future state of the system as a function of the present state. To reach that future state, it uses a neural network that changes the weight of its parameters as it learns." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"From a historical viewpoint, computationalism is a sophisticated version of behaviorism, for it only interpolates the computer program between stimulus and response, and does not regard novel programs as brain creations. [...] The root of computationalism is of course the actual similarity between brains and computers, and correspondingly between natural and artificial intelligence. The two are indeed similar because the artifacts in question have been designed to perform analogs of certain brain functions. And the computationalist program is an example of the strategy of treating similars as identicals." (Mario Bunge, "Matter and Mind: A Philosophical Inquiry", 2010)

"[...] we also distinguish knowledge from information, because some pieces of information, such as questions, orders, and absurdities do not constitute knowledge. And also because computers process information but, since they lack minds, they cannot be said to know anything." (Mario Bunge, "Matter and Mind: A Philosophical Inquiry", 2010)

"System dynamics models have little impact unless they change the way people perceive a situation. A model must help to organize information in a more understandable way. A model should link the past to the present by showing how present conditions arose, and extend the present into persuasive alternative futures under a variety of scenarios determined by policy alternatives. In other words, a system dynamics model, if it is to be effective, must communicate with and modify the prior mental models. Only people's beliefs - that is, their mental models - will determine action. Computer models must relate to and improve mental models if the computer models are to fill an effective role." (Jay W Forrester, "Modeling for What Purpose?", The Systems Thinker Vol. 24 (2), 2013)

"A computer makes calculations quickly and correctly, but doesn’t ask if the calculations are meaningful or sensible. A computer just does what it is told." (Gary Smith, "Standard Deviations", 2014)

"Now think about the prospect of competition from computers instead of competition from human workers. On the supply side, computers are far more different from people than any two people are different from each other: men and machines are good at fundamentally different things. People have intentionality - we form plans and make decisions in complicated situations. We’re less good at making sense of enormous amounts of data. Computers are exactly the opposite: they excel at efficient data processing, but they struggle to make basic judgments that would be simple for any human." (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"With fast computers and plentiful data, finding statistical significance is trivial. If you look hard enough, it can even be found in tables of random numbers." (Gary Smith, "Standard Deviations", 2014)

"Working an integral or performing a linear regression is something a computer can do quite effectively. Understanding whether the result makes sense - or deciding whether the method is the right one to use in the first place - requires a guiding human hand. When we teach mathematics we are supposed to be explaining how to be that guide. A math course that fails to do so is essentially training the student to be a very slow, buggy version of Microsoft Excel." (Jordan Ellenberg, "How Not to Be Wrong: The Power of Mathematical Thinking", 2014)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context."  (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015) 

"[…] the usefulness of mathematics is by no means limited to finite objects or to those that can be represented with a computer. Mathematical concepts depending on the idea of infinity, like real numbers and differential calculus, are useful models for certain aspects of physical reality." (Alfred S Posamentier & Bernd Thaller, "Numbers: Their tales, types, and treasures", 2015) 

"The human mind isn’t a computer; it cannot progress in an orderly fashion down a list of candidate moves and rank them by a score down to the hundredth of a pawn the way a chess machine does. Even the most disciplined human mind wanders in the heat of competition. This is both a weakness and a strength of human cognition. Sometimes these undisciplined wanderings only weaken your analysis. Other times they lead to inspiration, to beautiful or paradoxical moves that were not on your initial list of candidates." (Garry Kasparov, "Deep Thinking", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

27 November 2020

Data Warehousing: ETL - An Introduction


ETL (Extract, Transform, Load) processes, technologies or tools are about extracting data from one or more data sources via a set of queries, performing changes on the data via conversions, aggregations, mappings or other types of transformations, respectively loading the data into target tables or other type of repositories. Thus, an ETL process allows moving and transforming data between predefined data structures on an ad-hoc basis or as part of stable repetitive processes, which makes ETL ideal for data warehousing, data integrations, data migrations or similar scenarios. 

ETL Data Flow

Extract: The extraction of data is done typically based on SQL queries from relational databases or any OLEDB or ODBC-based data repositories including flat or MS Office files, though modern ETL tools can support other type of queries (CAML, XQuery, DAX) or even NoSQL architectures (Handoop). This allows addressing a wide range of requirements, the complexity of the logic depending on the functionality provided by the query languages, respectively the extraction functionality available.  

Transform: The transformation logic can be implemented based on the functionality provided by the ETL tool, and can involve after case any combination of aggregates, conditional splits, merges, lookups, multicasts, pivoting/unpivoting, cleansing, data conversions, sampling, mapping or any other transformations that can be performed on an in-transit dataset. On the other side, quite often the same can be achieved with the help of SQL-based manipulations directly in the extraction logic or later in the process. SQL can prove to be occasionally faster and more flexible than the transformations provided by the ETL tool, however despite the overlaps, the two approaches can complement each other when used adequately. 

Load: The load is usually just a dump of the data into one or more final or intermediary tables with predefined structures. Unless the data don’t match the data type, format or further defined constraints, the load seldom involve further challenges as long the solution was designed adequately. 

Within the logical model, extract, transform and load can be considered as process by themselves. Within the object model provided by the ETL tool, they are considered in the mentioned sequence within a data flow, which within a set of workflow constraints defines how the data move through the pipeline – the sequence of processing steps considered. The basic unit of work is the data flow and the workflow it belongs to, unit that can be encapsulated in one container for easier management or simply convenience. Several containers can be linked within a workflow to create more complex behavior. 

The data flows and workflow constraints, together with the supporting connections and containers form an ETL package, the main unit of work for encapsulating and running ETL logic. ETL packages are scheduled and run as fit for the purpose.

With the right design, these building blocks allow enough flexibility in handling ad-hoc requests or of building complex solutions. This involves decisions on how to partition the ETL packages, respectively the data flows, in which order they should be run, where and in which sequence the data should be transformed, how to handle exceptions, how to build eventually intermediary data repositories, how to handles audit requirements, and so on. Each of these choices can prove to be important. 

The knowledge of the ETL architecture and functionality is quintessential in providing the right solution for the problem considered, however once the basics were understood the challenges typically reside in understanding the source and/or target structures, the logical and physical entities available, identify the way the data can be partitioned horizontally or vertically, respectively what type of transformations are required for moving the data, as required by the solution. 

07 November 2020

DBMS: Event Streaming Databases (More of a Kafka’s Story)

Database Management

Event streaming architectures are architectures in which data are generated by different sources, and then processed, stored, analyzed, and acted upon in real-time by the different applications tapped into the data streams. An event streaming database is then a database that assures that its data are continuously up-to-date, providing specific functionality like management of connectors, materialized views and running queries on data-in-motion (rather than on static data). 

Reading about this type of technologies one can easily start fantasizing about the Web as a database in which intelligent agents can process streams of data in real-time, in which knowledge is derived and propagated over the networks in an infinitely and ever-growing flow in which the limits are hardly perceptible, in which the agents act as a mind disconnected in the end from the human intent. One is stroke by the fusing elements of realism and the fantastic aspects, more like in a Kafka’s story in which the metamorphosis of the technologies and social aspects can easily lead to absurd implications.

The link to Kafka was somehow suggested by Apache Kafka, an open-source distributed event streaming platform, which seems to lead the trends within this new-developing market. Kafka provides database functionality and guarantees the ACID (atomicity, concurrency, isolation, durability) properties of transactions while tapping into data streams. 

Data streaming is an appealing concept though it has some important challenges like data overload or over-flooding, the complexity derived from building specific (business) and integrity rules for processing the data, respectively for keeping data consistency and truth within the ever-growing and ever-changing flows. 

Data overload or over-flooding occurs when applications are not able to keep the pace with the volume of data or events fired with each change. Imagine the raindrops falling on a wide surface in which each millimeter or micrometer has its own rules for handling the raindrops and this at large scale. The raindrops must infiltrate into the surface to be processed and find their way to the beneath water flows, aggregating up to streams that could nurture seas or even oceans. Same metaphor can be applied to the data events, in which the data pervade applications accumulating in streams processed by engines to derive value. However heavy rains can easily lead to floods in which water aggregates at the surface. 

Business applications rely on predefined paths in which the flow of data is tidily linked to specific actions found themselves in processual sequences that reflect the material or cash flow. Any variation in the data flow from expectations will lead to inefficiencies and ultimately to chaos. Some benefit might be derived from data integrations between the business applications, however applications must be designed for this purpose and handle extreme behaviors like over-flooding. 

Data streams are maybe ideal for social media networks in which one broadcasts data through the networks and any consumer that can tap to the network can further use the respective data. We can see however the problems of nowadays social media – data, better said information, flow through the networks being changed as fit for purposes that can easily diverge from the initial intent. Moreover, information gets denatured, misused, overused to the degree that it overloads the networks, being more and more difficult to distinguish between reliable and non-reliable information. If common sense helps in the process of handling such information, not the same can be said about machines or applications. 

It will be challenging to deal with the vastness, vagueness, uncertainty, inconsistency, and deceit of the networks of the future, however data streaming more likely will have a future as long it can address such issues by design. 

06 November 2020

Graphical Representation: Reports vs. Data Visualizations

Graphical Representation

Considering visualizations, John Tukey remarked that ‘the greatest value of a picture is when it forces us to notice what we never expected to see’, which is not always the case for many of the graphics and visualizations available in organizations, typically in the form of simple charts and dashboards, quite often with no esthetics or meaning behind.

In general reports are needed as source for operational activities, in which the details in form of raw or aggregate data are important. As one moves further to the tactical or strategic aspects of a business, visualizations gain in importance especially when they allow encoding data and information, respectively variations, trends or relations in smaller places with minimal loss of information.

There are also different aspects of visualizations that need to be considered. Modern tools allow rapid visualization and interactive navigation of data across different variables which is great as long one knows what is searching for, which is not always the case.

There are junk charts in which the data drowns in graphical elements that bring no value to the reader, in extremis even distorting the message/meaning.

There are graphics/visualizations that attempt bringing together and encoding multiple variables in respect to a theme, and for which a ‘project’ is typically needed as data is not ad-hoc available, don’t have the desired quality or need further transformations to be ready for consumption. Good quality graphics/visualizations require time and a good understanding of the business, which are not necessarily available into the BI/Analytics teams, and unfortunately few organizations do something in that direction, ignoring typically such needs. In this type of environments is stressed the rapid availability of data for decision-making or action-relevant insight, which depends typically on the consumer.

The story-telling capabilities of graphics/visualizations are often exaggerated. Yes, they can tell a story though stories need to be framed into a context/problem, some background and further references need to be provided, while without detailed data the graphics/visualizations are just nice representations in which each consumer understands what he can.

In an ideal world the consumer and the ‘designer’ would work together to identify the important data for the theme considered, to find the appropriate level of detail, respectively the best form of encoding. Such attempts can stop at table-based representations (aka reports), respectively basic or richer forms of graphical representations. One can consider reports as an early stage of the visualization process, with the potential to derive move value when the data allow meaningful graphical representations. Unfortunately, the time, data and knowledge available seldom make this achievable.

In addition, a well-designed report can be used as basis for multiple purposes, while a graphic/visualization can enforce more limitations. Ideal would be when multiple forms of representation (including reports) are combined to harness the value of data. Navigations from visualizations to detailed data can be useful to understand what happens; learning and understanding the various aspects being an iterative process.

It’s also difficult to demonstrate the value of insight derived from visualizations, especially when graphical literacy goes behind the numeracy and statistical literacy - many consumers lacking the skills needed to evaluate numbers and statistics adequately. If for a good artistic movie you need an assistance to enjoy the show and understand the message(s) behind it, the same can be said also about good graphics/visualizations. Moreover, this requires creativity, abstraction-based thinking, and other capabilities to harness the value of representations.

Given the considerable volume of requirements related to the need of basis data, reports will continue to be on high demand in organizations. In exchange visualizations can complement them by providing insights otherwise not available.

Initially published on Medium as answer to a post on Reporting and Visualizations. 

31 October 2020

Data Warehousing: Data Lakes & other Puddles

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

30 October 2020

Data Science: Generalists vs Specialists in the Field of Data Science

Data Science

Division of labor favorizes the tasks done repeatedly, where knowledge of the broader processes is not needed, where aspects as creativity are needed only at a small scale. Division invaded the IT domains as tools, methodologies and demands increased in complexity, and therefore Data Science and BI/Analytics make no exception from this.

The scale of this development gains sometimes humorous expectations or misbelieves when one hears headhunters asking potential candidates whether they are upfront or backend experts when a good understanding of both aspects is needed for providing adequate results. The development gains tragicomical implications when one is limited in action only to a given area despite the extended expertise, or when a generalist seems to step on the feet of specialists, sometimes from the right entitled reasons. 

Headhunters’ behavior is rooted maybe in the poor understanding of the domain of expertise and implications of the job descriptions. It’s hard to understand how people sustain of having knowledge about a domain just because they heard the words flying around and got some glimpse of the connotations associated with the words. Unfortunately, this is extended to management and further in the business environment, with all the implications deriving from it. 

As Data Science finds itself at the intersection between Artificial Intelligence, Data Mining, Machine Learning, Neurocomputing, Pattern Recognition, Statistics and Data Processing, the center of gravity is hard to determine. One way of dealing with the unknown is requiring candidates to have a few years of trackable experience in the respective fields or in the use of a few tools considered as important in the respective domains. Of course, the usage of tools and techniques is important, though it’s a big difference between using a tool and understanding the how, when, why, where, in which ways and by what means a tool can be used effectively to create value. This can be gained only when one’s exposed to different business scenarios across industries and is a tough thing to demand from a profession found in its baby steps. 

Moreover, being a good data scientist involves having a deep insight into the businesses, being able to understand data and the demands associated with data – the various qualitative and quantitative aspects. Seeing the big picture is important in defining, approaching and solving problems. The more one is exposed to different techniques and business scenarios, with right understanding and some problem-solving skillset one can transpose and solve problems across domains. However, the generalist will find his limitations as soon a certain depth is reached, and the collaboration with a specialist is then required. A good collaboration between generalists and specialists is important in complex projects which overreach the boundaries of one person’s knowledge and skillset. 

Complexity is addressed when one can focus on the important characteristic of the problem, respectively when the models built can reflect the demands. The most important skillset besides the use of technical tools is the ability to model problems and root the respective problems into data, to elaborate theories and check them against reality. 

Complex problems can require specialization in certain fields, though seldom one problem is dependent only on one aspect of the business, as problems occur in overreaching contexts that span sometimes the borders of an organization. In addition, the ability to solve problems seem to be impacted by the diversity of the people involved into the task, sometimes even with backgrounds not directly related to organization’s activity. As in evolution, a team’s diversity is an important factor in achievement and learning, most gain being obtained when knowledge gets shared and harnessed beyond the borders of teams.

Written as answer to a Medium post on Data Science generalists vs specialists.

Data Science: Big Data vs. Business Strategies

Data Science

A strategy, independently on whether applied to organizations, chess, and other situations, allows identifying the moves having the most promising results from a range of possible moves that can change as one progresses into the game. Typically, the moves compete for same or similar resources, each move having at the respective time a potential value expressed in quantitative and/or qualitative terms, while the values are dependent on the information available about one’s and partners’ positions into the game. Therefore, a strategy is dependent on the decision-making processes in place, the information available about own business, respective the concurrence, as well about the game.

Big data is not about a technology but an umbrella term for multiple technologies that support in handling data with high volume, veracity, velocity or variety. The technologies attempt helping organizations in harnessing what is known as Big data (data having the before mentioned characteristics), for example by allowing answering to business questions, gaining insight into the business or market, improving decision-making. Through this Big data helps delivering value to businesses, at least in theory.

Big-data technologies can harness all data of an organization though this doesn’t imply that all data can provide value, especially when considered in respect to the investments made. Data bring value when they have the potential of uncovering hidden trends or (special) patterns of behavior, when they can be associated in new meaningful ways. Data that don’t reflect such characteristics are less susceptible of bringing value for an organization no matter how much one tries to process the respective data. However, looking at the data through multiple techniques can help organization get a better understanding of the data, though here is more about the processes of attempting understanding the data than the potential associated directly with the data.

Through active effort in understanding the data one becomes aware of the impact the quality of data have on business decisions, on how the business and processes are reflected in its data, how data can be used to control processes and focus on what matters. These are aspects that can be corroborated with the use of simple BI capabilities and don’t necessarily require more complex capabilities or tools. Therefore allowing employees the time to analyze and play with the data, can in theory have a considerable impact on how data are harnessed within an organization.

If an organization’s decision-making processes is dependent on actual data and insight (e.g. stock market) then the organization is more likely to profit from it. In opposition, organizations whose decision-making processes hand handle hours, days or months of latency in their data, then more likely the technologies will bring little value. Probably can be found similar examples for veracity, variety or similar characteristics consider under Big data.

The Big data technologies can make a difference especially when the extreme aspects of their characteristics can be harnessed. One talks about potential use which is different than the actual use. The use of technologies doesn’t equate with results, as knowledge about the tools and the business is mandatory to harness the respective tools. For example, insight doesn’t necessarily imply improved decision-making because it relies on people’s understanding about the business, about the numbers and models used.

That’s maybe one of the reasons why organization fail in deriving value from Big data. It’s great that companies invest in their Big data, Analytics/BI infrastructures, though without working actively in integrating the new insights/knowledge and upgrading people’s skillset, the effects will be under expectations. Investing in employees’ skillset is maybe one of the important decisions an organization can make as part of its strategy.

Written as answer to a Medium post on Big data and business strategies. 

Related Posts Plugin for WordPress, Blogger...