Showing posts with label Data science. Show all posts
Showing posts with label Data science. Show all posts

17 September 2024

#️⃣Software Engineering: Mea Culpa (Part V: All-Knowing Developers are Back in Demand?)

Software Engineering Series

I’ve been reading many job descriptions lately related to my experience and curiously or not I observed that many organizations look for developers with Microsoft Dynamics experience in the CRM, respectively Finance and Operations (F&O) and Business Central (BC) areas. It’s a good sign that the adoption of Microsoft solutions for CRM and ERP increases, especially when one considers the progress made in the BI and AI areas with the introduction of Microsoft Fabric, which gives Microsoft a considerable boost. Conversely, it seems that the "developers are good for everything" syntagma is back, at least from what one reads in job descriptions. 

Of course, it’s useful to have an inhouse developer who can address all the aspects of an implementation, though that’s a lot to ask considering the different non-programming areas that need to be addressed. It’s true that a developer with experience can handle Requirements, Data and Process Management, respectively Data Migrations and Business Intelligence topics, though if one considers that each of the topics can easily become a full-time job before, during and post-project implementations. I’ve been there and I (hopefully) know that the jobs imply. Even if an experienced programmer can easily handle the different aspects, there will be also times when all the topics combined will be too much for a person!

It's not a novelty that job descriptions are treated like Christmas lists, but it’s difficult to differentiate between essential and nonessential skillset. I read many jobs descriptions lately in which among a huge list of demands, one of the requirements is to program in the F&O framework, sign that D365 programmers are in high demand. I worked for many years as programmer and Software Engineer, respectively in the BI area, where SQL and non-SQL code is needed. Even if I can understand the code in F&O, does it make sense to learn now to program in X++ and the whole framework? 

It's never too late to learn new tricks, respectively another programming language and/or framework. It even helps to provide better solutions in other areas, though frankly I would invest my time in other areas, and AI-related topics like AI prompting or Data Science seem to be more interesting in the long term, especially when they are already in demand!

There seems to be a tendency for Data Science professionals to do everything, building their own solutions, ignoring the experience accumulated respectively the data models built in BI and Data Analytics areas, as if the topics and data models are unrelated! It’s also true that AI-modeling comes with its own requirements in what concerns data modeling (e.g. translating non-numeric to numeric values), though I believe that common ground can be found!

Similarly, the notebook-based programming seems to replicate logic in each solution, which occasionally makes sense, though personally I wouldn’t recommend it as practice! The other day, I was looking at code developed in Python to mimic the joining of tables, when a view with the same could be easier (re)used, maintained, read and probably more efficient, even if different engines will be used. It will be interesting to see how the mix of spaghetti solutions will evolve over time. There are developers already complaining of the number of objects used in the process by building logic for each layer from the medallion architecture! Even if it makes sense from architectural considerations, it will become a nightmare in time.

One can wonder also about nomenclature used – Data Engineer or Prompt Engineering for the simple manipulation of data between structures in data transformations, respectively for structuring the prompts for AI. I believe that engineering involves more than this, no matter the context! 

Previous Post <<||>> Next Post

17 February 2024

🧭Business Intelligence: A Software Engineer's Perspective I (Houston, we have a Problem!)

Business Intelligence Series
Business Intelligence Series

One of the critics addressed to the BI/Data Analytics, Data Engineering and even Data Science fields is their resistance to applying Software Engineering (SE) methods in practice. SE can be regarded as the application of sound methods, methodologies, techniques, principles, and practices to obtain high quality economic software in a reproducible manner. At minimum, should be applied SE techniques and practices proven to work, for example the use of best practices, reference technologies, standardized processes for requirements gathering and management, etc. This doesn't mean that one should apply the full extent of SE but consider a minimum that makes sense to adopt.

Unfortunately, the creation of data artifacts (queries, reports, data models, data pipelines, data visualizations, etc.) as process seem to be done after the principle of least action, though least action means here the minimum interaction to push pieces on a board rather than getting the things done. At high level, the process is as follows: get the requirements, build something, present results, get more requirements, do changes, present the results, and the process is repeated ad infinitum.

Given that data artifact's creation finds itself at the intersection of two or more knowledge areas in which knowledge is exchanged in several iterations between the parties involved until a common ground is achieved, this process is totally inefficient from multiple perspectives. First of all, it takes considerably more time than planned to reach a solution, resources being wasted in the process, multiple forms of waste being involved. Secondly, the exchange and retention of knowledge resulting from the process is minimal, mainly on a need by basis. This might look as an efficient approach on the short term, but is inefficient overall.

BI reflects the general issues from SE - most of the issues can be traced back to requirements - if the requirements are incorrect and there's no magic involved in between, then one can't expect for the solution to be correct. The bigger the difference between the initial and final requirements elicited in the process, the more resources are wasted. The more time passes between the start of the development phase and the time a solution is presented to the customer, the longer it takes to build the final solution. Same impact have the time it takes to establish a common ground and other critical factors for success involved in the process.

One can address these issues through better requirements elicitation, rapid prototyping, the use of agile methodologies and similar approaches, though the general feeling is that even if they bring improvements, they don't address the root causes - lack of data literacy skills, lack of knowledge about the business, lack of maturity in planning and executing tasks, the inexistence of well-designed processes and procedures, respectively the lack of an engineering mindset.

These inefficiencies have low impact when building a report occasionally, though they accumulate and tend to create systemic issues in what concerns the overall BI effort. They are addressed locally by experts and in general through a strategic approach like the elaboration of a BI strategy, though organizations seldom pay attention to them. Some organizations consider that they are automatically addressed as part of the data culture, though data culture focuses in general on data literacy and not on the whole set of assumptions mentioned above.

An experienced data professional sees more likely the inefficiencies, tries to address them locally in his interactions with the various stakeholders, he/she can build a business case for addressing them, though it depends on organizations to recognize that they have a problem, respective address the inefficiencies in a strategic and systemic manner!

Previous Post <<||>> Next Post

13 February 2024

🧭Business Intelligence: A One-Man Show (Part V: Focus on the Foundation)

Business Intelligence Suite
Business Intelligence Suite

I tend to agree that one person can't do anymore "everything in the data space", as Christopher Laubenthal put it his article on the topic [1]. He seems to catch the essence of some of the core data roles found in organizations. Summarizing these roles, data architecture is about designing and building a data infrastructure, data engineering is about moving data, database administration is mainly about managing databases, data analysis is about assisting the business with data and reports, information design is about telling stories, while data science can be about studying the impact of various components on the data. 

However, I find his analogy between a college's functional structure and the core data roles as poorly chosen from multiple perspectives, even if both are about building an infrastructure of some type. 

Firstly, the two constructions have different foundations. Data exists in a an organization also without data architects, data engineers or data administrators (DBAs)! It's enough to buy one or more information systems functioning as islands and reporting needs will arise. The need for a data architect might come when the systems need to be integrated or maybe when a data warehouse needs to be build, though many organizations are still in business without such constructs. While for the others, the more complex the integrations, the bigger the need for a Data Architect. Conversely, some systems can be integrated by design and such capabilities might drive their selection.

Data engineering is needed mainly in the context of the cloud, respectively of data lake-based architectures, where data needs to be moved, processed and prepared for consumption. Conversely, architectures like Microsoft Fabric minimize data movement, the focus being on data processing, the successive transformations it needs to suffer in moving from bronze to the gold layer, respectively in creating an organizational semantical data model. The complexity of the data processing is dependent on data' structuredness, quality and other data characteristics. 

As I mentioned before, modern databases, including the ones in the cloud, reduce the need for DBAs to a considerable degree. Unless the volume of work is big enough to consider a DBA role as an in-house resource, organizations will more likely consider involving a service provider and a contingent to cover the needs. 

Having in-house one or more people acting under the Data Analyst role, people who know and understand the business, respectively the data tools used in the process, can go a long way. Moreover, it's helpful to have an evangelist-like resource in house, a person who is able to raise awareness and knowhow, help diffuse knowledge about tools, techniques, data, results, best practices, respectively act as a mentor for the Data Analyst citizens. From my point of view, these are the people who form the data-related backbone (foundation) of an organization and this is the minimum of what an organization should have!

Once this established, one can build data warehouses, data integrations and other support architectures, respectively think about BI and Data strategy, Data Governance, etc. Of course, having a Chief Data Officer and a Data Strategy in place can bring more structure in handling the topics at the various levels - strategical, tactical, respectively operational. In constructions one starts with a blueprint and a data strategy can have the same effect, if one knows how to write it and implement it accordingly. However, the strategy is just a tool, while the data-knowledgeable workers are the foundation on which organizations should build upon!

"Build it and they will come" philosophy can work as well, though without knowledgeable and inquisitive people the philosophy has high chances to fail.

Previous Post <<||>> Next Post

Resources:
[1] Christopher Laubenthal (2024) "Why One Person Can’t Do Everything In Data" (link)

🧭Business Intelligence: A One-Man Show (Part IV: Data Roles between Past and Future)

Business Intelligence Series
Business Intelligence Series

Databases nowadays are highly secure, reliable and available to a degree that reduces the involvement of DBAs to a minimum. The more databases and servers are available in an organization, and the older they are, the bigger the need for dedicated resources to manage them. The number of DBAs involved tends to be proportional with the volume of work required by the database infrastructure. However, if the infrastructure is in the cloud, managed by the cloud providers, it's enough to have a person in the middle who manages the communication between cloud provider(s) and the organization. The person doesn't even need to be a DBA, even if some knowledge in the field is usually recommended.

The requirement for a Data Architect comes when there are several systems in place and there're multiple projects to integrate or build around the respective systems. It'a also the question of what drives the respective requirement - is it the knowledge of data architectures, the supervision of changes, and/or the review of technical documents? The requirement is thus driven by the projects in progress and those waiting in the pipeline. Conversely, if all the systems are in the cloud, their integration is standardized or doesn't involve much architectural knowledge, the role becomes obsolete or at least not mandatory. 

The Data Engineer role is a bit more challenging to define because it appeared in the context of cloud-based data architectures. It seems to be related to the data movement via ETL/ELT pipelines and of data processing and preparation for the various needs. Data modeling or data presentation knowledge isn't mandatory even if ideal. The role seems to overlap with the one of a Data Warehouse professional, be it a simple architect or developer. Role's knowhow depends also on the tools involved, because one thing is to build a solution based on a standard SQL Server, and another thing to use dedicated layers and architectures for the various purposes. Engineers' number should be proportional with the number of data entities involved.

Conversely, the existence of solutions that move and process the data as needed, can reduce the volume of work. Moreover, the use of AI-driven tools like Copilot might shift the focus from data to prompt engineering. 

The Data Analyst role is kind of a Cinderella - it can involve upon case everything from requirements elicitation to reports writing and results' interpretation, respectively from data collection and data modeling to data visualization. If you have a special wish related to your data, just add it to the role! Analysts' number should be related to the number of issues existing in organization where the collection and processing of data could make a difference. Conversely, the Data Citizen, even if it's not a role but a desirable state of art, could absorb in theory the Data Analyst role.

The Data Scientist is supposed to reveal the gems of knowledge hidden in the data by using Machine Learning, Statistics and other magical tools. The more data available, the higher the chances of finding something, even if probably statistically insignificant or incorrect. The role makes sense mainly in the context of big data, even if some opportunities might be available at smaller scales. Scientists' number depends on the number of projects focused on the big questions. Again, one talks about the Data Scientist citizen. 

The Information Designer role seems to be more about data visualization and presentation. It makes sense in the organizations that rely heavily on visual content. All the other organizations can rely on the default settings of data visualization tools, independently on whether AI is involved or not. 

Previous Post <<||>> Next Post

27 January 2024

Data Science: Back to the Future I (About Beginnings)

Data Science
Data Science Series

I've attended again, after several years, a webcast on performance improvement in SQL Server with Claudio Silva, “Writing T-SQL code for the engine, not for you”. The session was great and I really enjoyed it! I recommend it to any data(base) professional, even if some of the scenarios presented should be known already.

It's strange to see the same topics from 20-25 years ago reappearing over and over again despite the advancements made in the area of database engines. Each version of SQL Server brought something new in what concerns the performance, though without some good experience and understanding of the basic optimization and troubleshooting techniques there's little overall improvement for the average data professional in terms of writing and tuning queries!

Especially with the boom of Data Science topics, the volume of material on SQL increased considerably and many discover how easy is to write queries, even if the start might be challenging for some. Writing a query is easy indeed, though writing a performant query requires besides the language itself also some knowledge about the database engine and the various techniques used for troubleshooting and optimization. It's not about knowing in advance what the engine will do - the engine will often surprise you - but about knowing what techniques work, in what cases, which are their advantages and disadvantages, respectively on how they might impact the processing.

Making a parable with writing literature, it's not enough to speak a language; one needs more for becoming a writer, and there are so many levels of mastery! However, in database world even if creativity is welcomed, its role is considerable diminished by the constraints existing in the database engine, the problems to be solved, the time and the resources available. More important, one needs to understand some of the rules and know how to use the building blocks to solve problems and build reliable solutions.

The learning process for newbies focuses mainly on the language itself, while the exposure to complexity is kept to a minimum. For some learners the problems start when writing queries based on multiple tables -  what joins to use, in what order, how to structure the queries, what database objects to use for encapsulating the code, etc. Even if there are some guidelines and best practices, the learner must walk the path and experiment alone or in an organized setup.

In university courses the focus is on operators algebras, algorithms, on general database technologies and architectures without much hand on experience. All is too theoretical and abstract, which is acceptable for research purposes,  but not for the contact with the real world out there! Probably some labs offer exposure to real life scenarios, though what to cover first in the few hours scheduled for them?

This was the state of art when I started to learn SQL a quarter century ago, and besides the current tendency of cutting corners, the increased confidence from doing some tests, and the eagerness of shouting one’s shaking knowledge and more or less orthodox ideas on the various social networks, nothing seems to have changed! Something did change – the increased complexity of the problems to solve, and, considering the recent technological advances, one can afford now an AI learn buddy to write some code for us based on the information provided in the prompt.

This opens opportunities for learning and growth. AI can be used in the learning process by providing additional curricula for learners to dive deeper in some topics. Moreover, it can help us in time to address the challenges of the ever-increase complexity of the problems.

29 March 2021

Notes: Team Data Science Process (TDSP)

Team Data Science Process (TDSP)
Acronyms:
Artificial Intelligence (AI)
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Data Mining (DM)
Knowledge Discovery in Databases (KDD)
Team Data Science Process (TDSP) 
Version Control System (VCS)
Visual Studio Team Services (VSTS)

Resources:
[1] Microsoft Azure (2020) What is the Team Data Science Process? [source]
[2] Microsoft Azure (2020) The business understanding stage of the Team Data Science Process lifecycle [source]
[3] Microsoft Azure (2020) Data acquisition and understanding stage of the Team Data Science Process [source]
[4] Microsoft Azure (2020) Modeling stage of the Team Data Science Process lifecycle [source
[5] Microsoft Azure (2020) Deployment stage of the Team Data Science Process lifecycle [source]
[6] Microsoft Azure (2020) Customer acceptance stage of the Team Data Science Process lifecycle [source]

31 October 2020

🧊Data Warehousing: Architecture (Part III: Data Lakes & other Puddles)

Data Warehousing

One can consider a data lake as a repository of all of an organization’s data found in raw form, however this constraint might be too harsh as the data found at different levels of processing can be imported as well, for example the results of data mining or other Data Science techniques/methods can be considered as raw data for further processing.

In the initial definition provided by James Dixon, the difference between a data lake and a data mart/warehouse was expressed metaphorically as the transition from bottled water to lakes streamed (artificially) from various sources. It’s contrasted thus the objective-oriented, limited and single-purposed role of the data mart/warehouse in respect to the flow of data in nature that could be tapped and harnessed as desired. These are though metaphors intended to sensitize the buyer. Personally, I like to think of the data lake as an extension of the data infrastructure, in which the data mart or warehouse is integrant part. Imposing further constrains seem to have no benefit.  

Probably the most important characteristic of a data lake is that it makes the data of an organization discoverable and consumable, though from there to insight and other benefits is a long road and requires specific knowledge about the techniques used, as well about organization’s processes and data. Without this data lake-based solutions can lead to erroneous results, same as mixing several ingredients without having knowledge about their usage can lead to cooking experiments aloof from the art of cooking.

A characteristic of data is that they go through continuous change and have different timeliness, respectively degrees of quality in respect to the data quality dimensions implied and sources considered. Data need to reflect the reality at the appropriate level of detail and quality required by the processing application(s), this applying to data warehouses/marts as well data lake-based solutions.

Data found in raw form don’t necessarily represent the true/truth and don’t necessarily acquire a good quality no matter how much they are processed. Solutions need to be resilient in respect to the data they handle through their layers, independently of the data quality and transmission problems. Whether one talks about ETL, data migration or other types of data processing, keeping the data integrity at various levels and layers can be maybe the most important demand upon solutions.

Snapshots as moment-in-time recordings of tables, entities, sets of entities, datasets or whole databases, prove to be often the best mechanisms in keeping data integrity when this aspect is essential to their processing (e.g. data migrations, high-accuracy measurements). Unfortunately, the more systems are involved in the process and the broader span of the solutions over the sources, the more difficult it become to take such snapshots.

A SQL query’s output represents a snapshot of the data, therefore SQL-based solutions are usually appropriate for most of the business scenarios in which the characteristics of data (typically volume, velocity and/or variety) make their processing manageable. However, when the data are extracted by other means integrity is harder to obtain, especially when there’s no timestamp to allow data partitioning on a time scale, the handling of data integrity becoming thus in extremis a programmer’s task. In addition, getting snapshots of the data as they are changed can be a costly and futile task.

Further on, maintaining data integrity can prove to be a matter of design in respect not only to the processing of data, but also in respect to the source applications and the business processes they implement. The mastery of the underlying principles, techniques, patterns and methodologies, helps in the process of designing the right solutions.

Note:
Written as answer to a Medium post on data lakes and batch processing in data warehouses. 

30 October 2020

Data Science: Data Strategy (Part II: Generalists vs Specialists in the Field)

Data Science

Division of labor favorizes the tasks done repeatedly, where knowledge of the broader processes is not needed, where aspects as creativity are needed only at a small scale. Division invaded the IT domains as tools, methodologies and demands increased in complexity, and therefore Data Science and BI/Analytics make no exception from this.

The scale of this development gains sometimes humorous expectations or misbelieves when one hears headhunters asking potential candidates whether they are upfront or backend experts when a good understanding of both aspects is needed for providing adequate results. The development gains tragicomical implications when one is limited in action only to a given area despite the extended expertise, or when a generalist seems to step on the feet of specialists, sometimes from the right entitled reasons. 

Headhunters’ behavior is rooted maybe in the poor understanding of the domain of expertise and implications of the job descriptions. It’s hard to understand how people sustain of having knowledge about a domain just because they heard the words flying around and got some glimpse of the connotations associated with the words. Unfortunately, this is extended to management and further in the business environment, with all the implications deriving from it. 

As Data Science finds itself at the intersection between Artificial Intelligence, Data Mining, Machine Learning, Neurocomputing, Pattern Recognition, Statistics and Data Processing, the center of gravity is hard to determine. One way of dealing with the unknown is requiring candidates to have a few years of trackable experience in the respective fields or in the use of a few tools considered as important in the respective domains. Of course, the usage of tools and techniques is important, though it’s a big difference between using a tool and understanding the how, when, why, where, in which ways and by what means a tool can be used effectively to create value. This can be gained only when one’s exposed to different business scenarios across industries and is a tough thing to demand from a profession found in its baby steps. 

Moreover, being a good data scientist involves having a deep insight into the businesses, being able to understand data and the demands associated with data – the various qualitative and quantitative aspects. Seeing the big picture is important in defining, approaching and solving problems. The more one is exposed to different techniques and business scenarios, with right understanding and some problem-solving skillset one can transpose and solve problems across domains. However, the generalist will find his limitations as soon a certain depth is reached, and the collaboration with a specialist is then required. A good collaboration between generalists and specialists is important in complex projects which overreach the boundaries of one person’s knowledge and skillset. 

Complexity is addressed when one can focus on the important characteristic of the problem, respectively when the models built can reflect the demands. The most important skillset besides the use of technical tools is the ability to model problems and root the respective problems into data, to elaborate theories and check them against reality. 

Complex problems can require specialization in certain fields, though seldom one problem is dependent only on one aspect of the business, as problems occur in overreaching contexts that span sometimes the borders of an organization. In addition, the ability to solve problems seem to be impacted by the diversity of the people involved into the task, sometimes even with backgrounds not directly related to organization’s activity. As in evolution, a team’s diversity is an important factor in achievement and learning, most gain being obtained when knowledge gets shared and harnessed beyond the borders of teams.

Note:
Written as answer to a Medium post on Data Science generalists vs specialists.

14 January 2019

🔬Data Science: Evolutionary Algorithm (Definitions)

"An Evolutionary Algorithm (EA) is a general class of fitting or maximization techniques. They all maintain a pool of structures or models that can be mutated and evolve. At every stage in the algorithm, each model is graded and the better models are allowed to reproduce or mutate for the next round. Some techniques allow the successful models to crossbreed. They are all motivated by the biologic process of evolution. Some techniques are asexual (so, there is no crossbreeding between techniques) while others are bisexual, allowing successful models to swap ''genetic' information. The asexual models allow a wide variety of different models to compete, while sexual methods require that the models share a common 'genetic' code." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Meta-heuristic optimization approach inspired by natural evolution, which begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a sub-optimal solution which is close to the optimal one." (Gilles Lebrun et al, "EA Multi-Model Selection for SVM", 2009)

"Evolutionary algorithms are search methods that can be used for solving optimization problems. They mimic working principles from natural evolution by employing a population–based approach, labeling each individual of the population with a fitness and including elements of random, albeit the random is directed through a selection process." (Ivan Zelinka & Hendrik Richter, "Evolutionary Algorithms for Chaos Researchers", Studies in Computational Intelligence Vol. 267, 2010)

"Population-based optimization algorithms in which each member of the population represents a candidate solution. In an iterative process the population members evolve and are then evaluated by a fitness function. Genetic Algorithms and Particle Swarm Optimization are examples of evolutionary algorithms." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"A collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection paradigm. Thereby, the variation operators produce genetic diversity and the selection directs the evolutionary search." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

30 December 2018

🔭Data Science: Information (Just the Quotes)

"Probability, however, is not something absolute, [it is] drawn from certain information which, although it does not suffice to resolve the problem, nevertheless ensures that we judge correctly which of the two opposites is the easiest given the conditions known to us." (Gottfried W Leibniz, "Forethoughts for an encyclopaedia or universal science", cca. 1679)

"Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it." (Samuel Johnson, 1775)

"What is called science today consists of a haphazard heap of information, united by nothing, often utterly unnecessary, and not only failing to present one unquestionable truth, but as often as not containing the grossest errors, today put forward as truths, and tomorrow overthrown." (Leo Tolstoy, "What Is Art?", 1897)

"There can be no unique probability attached to any event or behaviour: we can only speak of ‘probability in the light of certain given information’, and the probability alters according to the extent of the information." (Sir Arthur S Eddington, "The Nature of the Physical World" , 1928)

"As words are not the things we speak about, and structure is the only link between them, structure becomes the only content of knowledge. If we gamble on verbal structures that have no observable empirical structures, such gambling can never give us any structural information about the world. Therefore such verbal structures are structurally obsolete, and if we believe in them, they induce delusions or other semantic disturbances." (Alfred Korzybski, "Science and Sanity", 1933)

"Much of the waste in business is due to lack of information. And when the information is available, waste often occurs because of lack of application or because of misapplication." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"Upon this gifted age, in its dark hour, rains from the sky a meteoric shower of facts […] they lie, unquestioned, uncombined. Wisdom enough to leach us of our ill is daily spun; but there exists no loom to weave it into a fabric." (Edna St. Vincent Millay, "Huntsman, What Quarry?", 1939)

"Just as entropy is a measure of disorganization, the information carried by a set of messages is a measure of organization. In fact, it is possible to interpret the information carried by a message as essentially the negative of its entropy, and the negative logarithm of its probability. That is, the more probable the message, the less information it gives. Clichés, for example, are less illuminating than great poems." (Norbert Wiener, "The Human Use of Human Beings", 1950)

"Knowledge is not something which exists and grows in the abstract. It is a function of human organisms and of social organization. Knowledge, that is to say, is always what somebody knows: the most perfect transcript of knowledge in writing is not knowledge if nobody knows it. Knowledge however grows by the receipt of meaningful information - that is, by the intake of messages by a knower which are capable of reorganising his knowledge." (Kenneth E Boulding, "General Systems Theory - The Skeleton of Science", Management Science Vol. 2 (3), 1956)

"We have overwhelming evidence that available information plus analysis does not lead to knowledge. The management science team can properly analyse a situation and present recommendations to the manager, but no change occurs. The situation is so familiar to those of us who try to practice management science that I hardly need to describe the cases." (C West Churchman, "Managerial acceptance of scientific recommendations", California Management Review Vol 7, 1964)

"This is the key of modern science and it was the beginning of the true understanding of Nature - this idea to look at the thing, to record the details, and to hope that in the information thus obtained might lie a clue to one or another theoretical interpretation." (Richard P Feynman, "The Character of Physical Law", 1965)

"[...] 'information' is not a substance or concrete entity but rather a relationship between sets or ensembles of structured variety." (Walter F Buckley, "Sociology and modern systems theory", 1967)

"There are as many types of questions as components in the information." (Jacques Bertin, Semiology of graphics [Semiologie Graphique], 1967)

"The idea of knowledge as an improbable structure is still a good place to start. Knowledge, however, has a dimension which goes beyond that of mere information or improbability. This is a dimension of significance which is very hard to reduce to quantitative form. Two knowledge structures might be equally improbable but one might be much more significant than the other." (Kenneth E Boulding, "Beyond Economics: Essays on Society", 1968)

"When action grows unprofitable, gather information; when information grows unprofitable, sleep. (Ursula K Le Guin, "The Left Hand of Darkness", 1969)

"What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it." (Herbert Simon, "Computers, Communications and the Public Interest", 1971)

"What we mean by information - the elementary unit of information - is a difference which makes a difference, and it is able to make a difference because the neural pathways along which it travels and is continually transformed are themselves provided with energy. The pathways are ready to be triggered. We may even say that the question is already implicit in them." (Gregory Bateson, "Steps to an Ecology of Mind", 1972)

"Science gets most of its information by the process of reductionism, exploring the details, then the details of the details, until all the smallest bits of the structure, or the smallest parts of the mechanism, are laid out for counting and scrutiny. Only when this is done can the investigation be extended to encompass the whole organism or the entire system. So we say. Sometimes it seems that we take a loss, working this way." (Lewis Thomas, "The Medusa and the Snail: More Notes of a Biology Watcher", 1974)

"Science is not a heartless pursuit of objective information. It is a creative human activity, its geniuses acting more as artists than information processors. Changes in theory are not simply the derivative results of the new discoveries but the work of creative imagination influenced by contemporary social and political forces." (Stephen J Gould, "Ever Since Darwin: Reflections in Natural History", 1977)

"Data, seeming facts, apparent asso­ciations-these are not certain knowledge of something. They may be puzzles that can one day be explained; they may be trivia that need not be explained at all. (Kenneth Waltz, "Theory of International Politics", 1979)

"To a considerable degree science consists in originating the maximum amount of information with the minimum expenditure of energy. Beauty is the cleanness of line in such formulations along with symmetry, surprise, and congruence with other prevailing beliefs." (Edward O Wilson, "Biophilia", 1984)

"Knowledge is the appropriate collection of information, such that it's intent is to be useful. Knowledge is a deterministic process. When someone 'memorizes' information (as less-aspiring test-bound students often do), then they have amassed knowledge. This knowledge has useful meaning to them, but it does not provide for, in and of itself, an integration such as would infer further knowledge." (Russell L Ackoff, "Towards a Systems Theory of Organization", 1985)

"Information is data that has been given meaning by way of relational connection. This 'meaning' can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it." (Russell L Ackoff, "Towards a Systems Theory of Organization", 1985)

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"Information exists. It does not need to be perceived to exist. It does not need to be understood to exist. It requires no intelligence to interpret it. It does not have to have meaning to exist. It exists." (Tom Stonier, "Information and the Internal Structure of the Universe: An Exploration into Information Physics", 1990)

"What about confusing clutter? Information overload? Doesn't data have to be ‘boiled down’ and  ‘simplified’? These common questions miss the point, for the quantity of detail is an issue completely separate from the difficulty of reading. Clutter and confusion are failures of design, not attributes of information." (Edward R Tufte, "Envisioning Information", 1990)

"Knowledge is theory. We should be thankful if action of management is based on theory. Knowledge has temporal spread. Information is not knowledge. The world is drowning in information but is slow in acquisition of knowledge. There is no substitute for knowledge." (William E Deming, "The New Economics for Industry, Government, Education", 1993)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"[Schemata are] knowledge structures that represent objects or events and provide default assumptions about their characteristics, relationships, and entailments under conditions of incomplete information." (Paul J DiMaggio, "Culture and Cognition", Annual Review of Sociology No. 23, 1997)

"When it comes to information, it turns out that one can have too much of a good thing. At a certain level of input, the law of diminishing returns takes effect; the glut of information no longer adds to our quality of life, but instead begins to cultivate stress, confusion, and even ignorance." (David Shenk, "Data Smog", 1997)

"Each element in the system is ignorant of the behavior of the system as a whole, it responds only to information that is available to it locally. This point is vitally important. If each element ‘knew’ what was happening to the system as a whole, all of the complexity would have to be present in that element." (Paul Cilliers, "Complexity and Postmodernism: Understanding Complex Systems" , 1998)

"Complexity is that property of a model which makes it difficult to formulate its overall behaviour in a given language, even when given reasonably complete information about its atomic components and their inter-relations." (Bruce Edmonds, "Syntactic Measures of Complexity", 1999)

"A model isolates one or a few causal connections, mechanisms, or processes, to the exclusion of other contributing or interfering factors - while in the actual world, those other factors make their effects felt in what actually happens. Models may seem true in the abstract, and are false in the concrete. The key issue is about whether there is a bridge between the two, the abstract and the concrete, such that a simple model can be relied on as a source of relevantly truthful information about the complex reality." (Uskali Mäki, "Fact and Fiction in Economics: Models, Realism and Social Construction", 2002)

"Entropy is not about speeds or positions of particles, the way temperature and pressure and volume are, but about our lack of information." (Hans C von Baeyer," Information, The New Language of Science", 2003)

"The use of computers shouldn't ignore the objectives of graphics, that are: 
 1) Treating data to get information. 
 2) Communicating, when necessary, the information obtained." (Jacques Bertin, [interview] 2003)

"There is no end to the information we can use. A 'good' map provides the information we need for a particular purpose - or the information the mapmaker wants us to have. To guide us, a map’s designers must consider more than content and projection; any single map involves hundreds of decisions about presentation." (Peter Turchi, "Maps of the Imagination: The writer as cartographer", 2004)

"While in theory randomness is an intrinsic property, in practice, randomness is incomplete information." (Nassim N Taleb, "The Black Swan", 2007)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Access to more information isn’t enough - the information needs to be correct, timely, and presented in a manner that enables the reader to learn from it. The current network is full of inaccurate, misleading, and biased information that often crowds out the valid information. People have not learned that 'popular' or 'available' information is not necessarily valid." (Gene Spafford, 2010) 

"We face danger whenever information growth outpaces our understanding of how to process it. The last forty years of human history imply that it can still take a long time to translate information into useful knowledge, and that if we are not careful, we may take a step back in the meantime." (Nate Silver, "The Signal and the Noise", 2012)

"Complexity has the propensity to overload systems, making the relevance of a particular piece of information not statistically significant. And when an array of mind-numbing factors is added into the equation, theory and models rarely conform to reality." (Lawrence K Samuels, "Defense of Chaos: The Chaology of Politics, Economics and Human Action", 2013)

"Complexity scientists concluded that there are just too many factors - both concordant and contrarian - to understand. And with so many potential gaps in information, almost nobody can see the whole picture. Complex systems have severe limits, not only to predictability but also to measurability. Some complexity theorists argue that modelling, while useful for thinking and for studying the complexities of the world, is a particularly poor tool for predicting what will happen." (Lawrence K Samuels, "Defense of Chaos: The Chaology of Politics, Economics and Human Action", 2013)

"One of the most powerful transformational catalysts is knowledge, new information, or logic that defies old mental models and ways of thinking" (Elizabeth Thornton, "The Objective Leader", 2015)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context."  (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

🔭Data Science: Data Analysis (Just the Quotes)

"As in Mathematics, so in Natural Philosophy, the Investigation of difficult Things by the Method of Analysis, ought ever to precede the Method of Composition. This Analysis consists in making Experiments and Observations, and in drawing general Conclusions from them by Induction, and admitting of no Objections against the Conclusions but such as are taken from Experiments, or other certain Truths." (Sir Isaac Newton, "Opticks", 1704)

"The errors which arise from the absence of facts are far more numerous and more durable than those which result from unsound reasoning respecting true data." (Charles Babbage, "On the Economy of Machinery and Manufactures", 1832)

"In every branch of knowledge the progress is proportional to the amount of facts on which to build, and therefore to the facility of obtaining data." (James C Maxwell, [letter to Lewis Campbell] 1851)

"Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data." (Roy D G Allen, "Statistics for Economists", 1951)

"The technical analysis of any large collection of data is a task for a highly trained and expensive man who knows the mathematical theory of statistics inside and out. Otherwise the outcome is likely to be a collection of drawings - quartered pies, cute little battleships, and tapering rows of sturdy soldiers in diversified uniforms - interesting enough in the colored Sunday supplement, but hardly the sort of thing from which to draw reliable inferences." (Eric T Bell, "Mathematics: Queen and Servant of Science", 1951)

"If data analysis is to be well done, much of it must be a matter of judgment, and ‘theory’ whether statistical or non-statistical, will have to guide, not command." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33 (1), 1962)

"If one technique of data analysis were to be exalted above all others for its ability to be revealing to the mind in connection with each of many different models, there is little doubt which one would be chosen. The simple graph has brought more information to the data analyst’s mind than any other device. It specializes in providing indications of unexpected phenomena." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics Vol. 33 (1), 1962)

"The most important maxim for data analysis to heed, and one which many statisticians seem to have shunned is this: ‘Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.’ Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate." (John W Tukey, "The Future of Data Analysis", Annals of Mathematical Statistics, Vol. 33, No. 1, 1962)

"The first step in data analysis is often an omnibus step. We dare not expect otherwise, but we equally dare not forget that this step, and that step, and other step, are all omnibus steps and that we owe the users of such techniques a deep and important obligation to develop ways, often varied and competitive, of replacing omnibus procedures by ones that are more sharply focused." (John W Tukey, "The Future of Processes of Data Analysis", 1965)

"The basic general intent of data analysis is simply stated: to seek through a body of data for interesting relationships and information and to exhibit the results in such a way as to make them recognizable to the data analyzer and recordable for posterity. Its creative task is to be productively descriptive, with as much attention as possible to previous knowledge, and thus to contribute to the mysterious process called insight." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Comparable objectives in data analysis are (l) to achieve more specific description of what is loosely known or suspected; (2) to find unanticipated aspects in the data, and to suggest unthought-of-models for the data's summarization and exposure; (3) to employ the data to assess the (always incomplete) adequacy of a contemplated model; (4) to provide both incentives and guidance for further analysis of the data; and (5) to keep the investigator usefully stimulated while he absorbs the feeling of his data and considers what to do next." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Data analysis must be iterative to be effective. [...] The iterative and interactive interplay of summarizing by fit and exposing by residuals is vital to effective data analysis. Summarizing and exposing are complementary and pervasive." (John W Tukey & Martin B Wilk, "Data Analysis and Statistics: An Expository Overview", 1966)

"Every student of the art of data analysis repeatedly needs to build upon his previous statistical knowledge and to reform that foundation through fresh insights and emphasis." (John W Tukey, "Data Analysis, Including Statistics", 1968)

"[...] bending the question to fit the analysis is to be shunned at all costs." (John W Tukey, "Analyzing Data: Sanctification or Detective Work?", 1969)

"Statistical methods are tools of scientific investigation. Scientific investigation is a controlled learning process in which various aspects of a problem are illuminated as the study proceeds. It can be thought of as a major iteration within which secondary iterations occur. The major iteration is that in which a tentative conjecture suggests an experiment, appropriate analysis of the data so generated leads to a modified conjecture, and this in turn leads to a new experiment, and so on." (George E P Box & George C Tjao, "Bayesian Inference in Statistical Analysis", 1973)

"Almost all efforts at data analysis seek, at some point, to generalize the results and extend the reach of the conclusions beyond a particular set of data. The inferential leap may be from past experiences to future ones, from a sample of a population to the whole population, or from a narrow range of a variable to a wider range. The real difficulty is in deciding when the extrapolation beyond the range of the variables is warranted and when it is merely naive. As usual, it is largely a matter of substantive judgment - or, as it is sometimes more delicately put, a matter of 'a priori nonstatistical considerations'." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"[…] it is not enough to say: 'There's error in the data and therefore the study must be terribly dubious'. A good critic and data analyst must do more: he or she must also show how the error in the measurement or the analysis affects the inferences made on the basis of that data and analysis." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"The use of statistical methods to analyze data does not make a study any more 'scientific', 'rigorous', or 'objective'. The purpose of quantitative analysis is not to sanctify a set of findings. Unfortunately, some studies, in the words of one critic, 'use statistics as a drunk uses a street lamp, for support rather than illumination'. Quantitative techniques will be more likely to illuminate if the data analyst is guided in methodological choices by a substantive understanding of the problem he or she is trying to learn about. Good procedures in data analysis involve techniques that help to (a) answer the substantive questions at hand, (b) squeeze all the relevant information out of the data, and (c) learn something new about the world." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"Typically, data analysis is messy, and little details clutter it. Not only confounding factors, but also deviant cases, minor problems in measurement, and ambiguous results lead to frustration and discouragement, so that more data are collected than analyzed. Neglecting or hiding the messy details of the data reduces the researcher's chances of discovering something new." (Edward R Tufte, "Data Analysis for Politics and Policy", 1974)

"[...] be wary of analysts that try to quantify the unquantifiable." (Ralph Keeney & Raiffa Howard, "Decisions with Multiple Objectives: Preferences and Value Trade-offs", 1976)

"[...] exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as for those we believe might be there. Except for its emphasis on graphs, its tools are secondary to its purpose." (John W Tukey, [comment] 1979)

"[...] any hope that we are smart enough to find even transiently optimum solutions to our data analysis problems is doomed to failure, and, indeed, if taken seriously, will mislead us in the allocation of effort, thus wasting both intellectual and computational effort." (John W Tukey, "Choosing Techniques for the Analysis of Data", 1981)

"The fact must be expressed as data, but there is a problem in that the correct data is difficult to catch. So that I always say 'When you see the data, doubt it!' 'When you see the measurement instrument, doubt it!' [...]For example, if the methods such as sampling, measurement, testing and chemical analysis methods were incorrect, data. […] to measure true characteristics and in an unavoidable case, using statistical sensory test and express them as data." (Kaoru Ishikawa, Annual Quality Congress Transactions, 1981)

"Exploratory data analysis, EDA, calls for a relatively free hand in exploring the data, together with dual obligations: (•) to look for all plausible alternatives and oddities - and a few implausible ones, (graphic techniques can be most helpful here) and (•) to remove each appearance that seems large enough to be meaningful - ordinarily by some form of fitting, adjustment, or standardization [...] so that what remains, the residuals, can be examined for further appearances." (John W Tukey, "Introduction to Styles of Data Analysis Techniques", 1982)

"A competent data analysis of an even moderately complex set of data is a thing of trials and retreats, of dead ends and branches." (John W Tukey, Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface, 1983)

"Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]" (George Greenstein, "Frozen Star", 1983)

"Iteration and experimentation are important for all of data analysis, including graphical data display. In many cases when we make a graph it is immediately clear that some aspect is inadequate and we regraph the data. In many other cases we make a graph, and all is well, but we get an idea for studying the data in a different way with a different graph; one successful graph often suggests another." (William S Cleveland, "The Elements of Graphing Data", 1985)

"There are some who argue that a graph is a success only if the important information in the data can be seen within a few seconds. While there is a place for rapidly-understood graphs, it is too limiting to make speed a requirement in science and technology, where the use of graphs ranges from, detailed, in-depth data analysis to quick presentation." (William S Cleveland, "The Elements of Graphing Data", 1985)

"A first analysis of experimental results should, I believe, invariably be conducted using flexible data analytical techniques - looking at graphs and simple statistics - that so far as possible allow the data to 'speak for themselves'. The unexpected phenomena that such a approach often uncovers can be of the greatest importance in shaping and sometimes redirecting the course of an ongoing investigation." (George Box, "Signal to Noise Ratios, Performance Criteria, and Transformations", Technometrics 30, 1988)

"Data analysis is an art practiced by individuals who are skilled at quantitative reasoning and have much experience in looking at numbers and detecting  patterns in data. Usually these individuals have some background in statistics." (David Lubinsky, Daryl Pregibon , "Data analysis as search", Journal of Econometrics Vol. 38 (1–2), 1988)

"Like a detective, a data analyst will experience many dead ends, retrace his steps, and explore many alternatives before settling on a single description of the evidence in front of him." (David Lubinsky & Daryl Pregibon , "Data analysis as search", Journal of Econometrics Vol. 38 (1–2), 1988)

"[…] data analysis in the context of basic mathematical concepts and skills. The ability to use and interpret simple graphical and numerical descriptions of data is the foundation of numeracy […] Meaningful data aid in replacing an emphasis on calculation by the exercise of judgement and a stress on interpreting and communicating results." (David S Moore, "Statistics for All: Why, What and How?", 1990)

"Data analysis is rarely as simple in practice as it appears in books. Like other statistical techniques, regression rests on certain assumptions and may produce unrealistic results if those assumptions are false. Furthermore it is not always obvious how to translate a research question into a regression model." (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"Data analysis typically begins with straight-line models because they are simplest, not because we believe reality is inherently linear. Theory or data may suggest otherwise [...]" (Lawrence C Hamilton, "Regression with Graphics: A second course in applied statistics", 1991)

"90 percent of all problems can be solved by using the techniques of data stratification, histograms, and control charts. Among the causes of nonconformance, only one-fifth or less are attributable to the workers." (Kaoru Ishikawa, The Quality Management Journal Vol. 1, 1993)

"Probabilistic inference is the classical paradigm for data analysis in science and technology. It rests on a foundation of randomness; variation in data is ascribed to a random process in which nature generates data according to a probability distribution. This leads to a codification of uncertainly by confidence intervals and hypothesis tests." (William S Cleveland, "Visualizing Data", 1993)

"Visualization is an approach to data analysis that stresses a penetrating look at the structure of data. No other approach conveys as much information. […] Conclusions spring from data when this information is combined with the prior knowledge of the subject under investigation." (William S Cleveland, "Visualizing Data", 1993)

"When the distributions of two or more groups of univariate data are skewed, it is common to have the spread increase monotonically with location. This behavior is monotone spread. Strictly speaking, monotone spread includes the case where the spread decreases monotonically with location, but such a decrease is much less common for raw data. Monotone spread, as with skewness, adds to the difficulty of data analysis. For example, it means that we cannot fit just location estimates to produce homogeneous residuals; we must fit spread estimates as well. Furthermore, the distributions cannot be compared by a number of standard methods of probabilistic inference that are based on an assumption of equal spreads; the standard t-test is one example. Fortunately, remedies for skewness can cure monotone spread as well." (William S Cleveland, "Visualizing Data", 1993)

"A careful and sophisticated analysis of the data is often quite useless if the statistician cannot communicate the essential features of the data to a client for whom statistics is an entirely foreign language." (Christopher J Wild, "Embracing the ‘Wider view’ of Statistics", The American Statistician 48, 1994)

"Science is not impressed with a conglomeration of data. It likes carefully constructed analysis of each problem." (Daniel E Koshland Jr, Science Vol. 263 (5144), [editorial] 1994)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"The purpose of analysis is insight. The best analysis is the simplest analysis which provides the needed insight." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Without meaningful data there can be no meaningful analysis. The interpretation of any data set must be based upon the context of those data." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"Statistical analysis of data can only be performed within the context of selected assumptions, models, and/or prior distributions. A statistical analysis is actually the extraction of substantive information from data and assumptions. And herein lies the rub, understood well by Disraeli and others skeptical of our work: For given data, an analysis can usually be selected which will result in 'information' more favorable to the owner of the analysis then is objectively warranted." (Stephen B Vardeman & Max D Morris, "Statistics and Ethics: Some Advice for Young Statisticians", The American Statistician vol 57, 2003)

"Exploratory Data Analysis is more than just a collection of data-analysis techniques; it provides a philosophy of how to dissect a data set. It stresses the power of visualisation and aspects such as what to look for, how to look for it and how to interpret the information it contains. Most EDA techniques are graphical in nature, because the main aim of EDA is to explore data in an open-minded way. Using graphics, rather than calculations, keeps open possibilities of spotting interesting patterns or anomalies that would not be apparent with a calculation (where assumptions and decisions about the nature of the data tend to be made in advance)." (Alan Graham, "Developing Thinking in Statistics", 2006)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Data analysis is careful thinking about evidence." (Michael Milton, "Head First Data Analysis", 2009)

"Doing data analysis without explicitly defining your problem or goal is like heading out on a road trip without having decided on a destination." (Michael Milton, "Head First Data Analysis", 2009)

"The discrepancy between our mental models and the real world may be a major problem of our times; especially in view of the difficulty of collecting, analyzing, and making sense of the unbelievable amount of data to which we have access today." (Ugo Bardi, "The Limits to Growth Revisited", 2011)

"Data analysis is not generally thought of as being simple or easy, but it can be. The first step is to understand that the purpose of data analysis is to separate any signals that may be contained within the data from the noise in the data. Once you have filtered out the noise, anything left over will be your potential signals. The rest is just details." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. Any data analyst needs to know how to organize and use these four questions in order to obtain meaningful and correct results. [...] 
THE DESCRIPTION QUESTION: Given a collection of numbers, are there arithmetic values that will summarize the information contained in those numbers in some meaningful way?
THE PROBABILITY QUESTION: Given a known universe, what can we say about samples drawn from this universe? [...] 
THE INFERENCE QUESTION: Given an unknown universe, and given a sample that is known to have been drawn from that unknown universe, and given that we know everything about the sample, what can we say about the unknown universe? [...] 
THE HOMOGENEITY QUESTION: Given a collection of observations, is it reasonable to assume that they came from one universe, or do they show evidence of having come from multiple universes?" (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Each systems archetype embodies a particular theory about dynamic behavior that can serve as a starting point for selecting and formulating raw data into a coherent set of interrelationships. Once those relationships are made explicit and precise, the 'theory' of the archetype can then further guide us in our data-gathering process to test the causal relationships through direct observation, data analysis, or group deliberation." (Daniel H Kim, "Systems Archetypes as Dynamic Theories", The Systems Thinker Vol. 24 (1), 2013)

"A complete data analysis will involve the following steps: (i) Finding a good model to fit the signal based on the data. (ii) Finding a good model to fit the noise, based on the residuals from the model. (iii) Adjusting variances, test statistics, confidence intervals, and predictions, based on the model for the noise.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"The random element in most data analysis is assumed to be white noise - normal errors independent of each other. In a time series, the errors are often linked so that independence cannot be assumed (the last examples). Modeling the nature of this dependence is the key to time series.(DeWayne R Derryberry, "Basic data analysis for time series with R", 2014)

"Statistics is an integral part of the quantitative approach to knowledge. The field of statistics is concerned with the scientific study of collecting, organizing, analyzing, and drawing conclusions from data." (Kandethody M Ramachandran & Chris P Tsokos, "Mathematical Statistics with Applications in R" 2nd Ed., 2015)

"Too often there is a disconnect between the people who run a study and those who do the data analysis. This is as predictable as it is unfortunate. If data are gathered with particular hypotheses in mind, too often they (the data) are passed on to someone who is tasked with testing those hypotheses and who has only marginal knowledge of the subject matter. Graphical displays, if prepared at all, are just summaries or tests of the assumptions underlying the tests being done. Broader displays, that have the potential of showing us things that we had not expected, are either not done at all, or their message is not able to be fully appreciated by the data analyst." (Howard Wainer, Comment, Journal of Computational and Graphical Statistics Vol. 20(1), 2011)

"The dialectical interplay of experiment and theory is a key driving force of modern science. Experimental data do only have meaning in the light of a particular model or at least a theoretical background. Reversely theoretical considerations may be logically consistent as well as intellectually elegant: Without experimental evidence they are a mere exercise of thought no matter how difficult they are. Data analysis is a connector between experiment and theory: Its techniques advise possibilities of model extraction as well as model testing with experimental data." (Achim Zielesny, "From Curve Fitting to Machine Learning" 2nd Ed., 2016)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"[…] the data itself can lead to new questions too. In exploratory data analysis (EDA), for example, the data analyst discovers new questions based on the data. The process of looking at the data to address some of these questions generates incidental visualizations - odd patterns, outliers, or surprising correlations that are worth looking into further." (Danyel Fisher & Miriah Meyer, "Making Data Visual", 2018)

"Analysis is a two-step process that has an exploratory and an explanatory phase. In order to create a powerful data story, you must effectively transition from data discovery (when you’re finding insights) to data communication (when you’re explaining them to an audience). If you don’t properly traverse these two phases, you may end up with something that resembles a data story but doesn’t have the same effect. Yes, it may have numbers, charts, and annotations, but because it’s poorly formed, it won’t achieve the same results." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"While visuals are an essential part of data storytelling, data visualizations can serve a variety of purposes from analysis to communication to even art. Most data charts are designed to disseminate information in a visual manner. Only a subset of data compositions is focused on presenting specific insights as opposed to just general information. When most data compositions combine both visualizations and text, it can be difficult to discern whether a particular scenario falls into the realm of data storytelling or not." (Brent Dykes, "Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals", 2019)

"If the data that go into the analysis are flawed, the specific technical details of the analysis don’t matter. One can obtain stupid results from bad data without any statistical trickery. And this is often how bullshit arguments are created, deliberately or otherwise. To catch this sort of bullshit, you don’t have to unpack the black box. All you have to do is think carefully about the data that went into the black box and the results that came out. Are the data unbiased, reasonable, and relevant to the problem at hand? Do the results pass basic plausibility checks? Do they support whatever conclusions are drawn?" (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"We all know that the numerical values on each side of an equation have to be the same. The key to dimensional analysis is that the units have to be the same as well. This provides a convenient way to keep careful track of units when making calculations in engineering and other quantitative disciplines, to make sure one is computing what one thinks one is computing. When an equation exists only for the sake of mathiness, dimensional analysis often makes no sense." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Overall [...] everyone also has a need to analyze data. The ability to analyze data is vital in its understanding of product launch success. Everyone needs the ability to find trends and patterns in the data and information. Everyone has a need to ‘discover or reveal (something) through detailed examination’, as our definition says. Not everyone needs to be a data scientist, but everyone needs to drive questions and analysis. Everyone needs to dig into the information to be successful with diagnostic analytics. This is one of the biggest keys of data literacy: analyzing data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

[Murphy’s Laws of Analysis:] "(1) In any collection of data, the figures that are obviously correct contain errors. (2) It is customary for a decimal to be misplaced. (3) An error that can creep into a calculation, will. Also, it will always be in the direction that will cause the most damage to the calculation." (G C Deakly)

"We must include in any language with which we hope to describe complex data-processing situations the capability for describing data. We must also include a mechanism for determining the priorities to be applied to the data. These priorities are not fixed and are indicated in many cases by the data." (Grace Hopper) 

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.