SQL Troubles: data analysis

Showing posts with label data analysis. Show all posts

24 May 2026

🖍️Cheryl Cihon - Collected Quotes

"A combination of graphical and tabular presentations may be used to good advantage. The former illustrates most effectively qualitative characteristics (e.g., changes of data with time or sequence) while the latter is the best means to present quantitative information." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Each systematic error associated with a given measurement process is always of the same sign and magnitude. It persists measurement after measurement. When its existence is established, such an error is called a bias, and reasonable effort should be made to correct for it. Sometimes the observed bias is the result of the concurrence of several biases that cannot or at least have not been individually identified. One of the purposes of statistical treatment of data is to decide whether an apparently erroneous result is real and indicates a bias or whether it could happen as the result of chance variability, even in a well-behaved measurement system. There can be, of course, biases that have not been identified as such. Also, there are limits to how well one can correct for known biases, and this inadequacy must be considered when limits of uncertainty are assigned to data." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Essentially, the null hypothesis is that there is not a significant difference between two results. It will be seen that differences may have to be quite large in some instances before they are statistically significant, especially in the case of small data sets of high variability. Statistics will not say whether or not an apparent difference is real, but will only give the probability that it could have been as large as it is by chance alone. Often, the answer will be that there is no reason to believe a difference exists other than due to a chance occurrence, based on the statistical evidence available. Remember that this is not saying that there is no difference but that the evidence presented is insufficient to support the belief that the difference is not more than a random effect." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Frequency distributions, commonly called histograms, are special kinds of bar charts that are used widely for displaying variability of scientific and technical information. Such displays may be used to demonstrate that a normal distribution is or is not achieved [...]. Generally, a minimum of 25 data points is required to prepare a good bar chart, and considerably more is highly desirable. The data are divided into groups bounded by cells of fixed limits. The number of cells chosen to cover the range of values for the data is somewhat arbitrary. If too few, a distribution can lack resolution; if too many, there can be numerous unpopulated cells in the case of small data sets. Trial and error may be used in a specific case to decide what is most effective." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Nomographs are effective ways to graphically calculate various functionally related quantities. Nomographs are really graphical computational devices. They were once used widely in engineering situations when calculating was more laborious than at the present time, and they still can be useful when complex relationships are concerned. In brief, scales are laid out in which the scale intervals and placement of the lines are chosen by well-established procedures. A straight edge can then be used to interconnect independent variables so the corresponding values of dependent variables can be read." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Pie charts are more comprehensible as the sectors are approximately equal. A feeling of relationship is lost as very small sectors are placed alongside very large ones. In any case, numerical values need to be inserted in the sectors or related to them by lines or arrows to provide numerical significance, since the eye is not a good quantitative judge of the relative areas of sectors. The total number of sectors used should be reasonably small. While not a hard and fast rule, a maximum of eight sectors is a reasonable number. Sectors may be homogeneous or consist of conglomerates of several items. The information contained in a sector may be displayed as a separate pie chart. This is an effective way to handle conglomerates." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The inevitability of variability complicates the evaluation and use of data. It must be recognized that many uses require data quality that may be difficult to achieve. There are minimum quality standards required for every measurement situation (sometimes called data quality objectives). These standards should be established in advance and both the producer and the user must be able to determine whether they have been met. The only way that this can be accomplished is to attain statistical control of the measurement process and to apply valid statistical procedures in the analysis of the data." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The quantitative accuracy of what is measured is an obvious indicator of data quality. Because of inescapable variability, data will always have some degree of uncertainty. When measurement plans are properly made and adequately executed, it is possible to assign quantitative limits of uncertainty to measured values." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The second type of uncertainty results from random causes that produce fluctuations in both sign and magnitude, the latter within well-defined limits, however. In the long run, the random error averages out to zero. The random error accounts for the variability of individual measurements and it will be shown that it can be statistically characterized by what is called a standard deviation. This term is thus a measure of the dispersion of the data around a mean or average value. When the value of the standard deviation is small, the data cluster closely around the mean; when it is large, the spread is greater." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"The use of tables is perhaps the most common method for presentation of data. The format will vary, depending on what information is needed to be conveyed. Even a cursory perusal of the scientific literature will reveal many examples of both good and poor tables. A good table is simply one that presents data in an easily understandable manner. Tables should be relatively simple in order to promote understanding and the columns should have a clear relationship to each other. Column titles should be as brief as possible, consistent with clarity. Footnotes may be needed in some cases to provide further explanation of the headings." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

"Variability is inevitable in a measurement process. The operation of a measurement process does not produce one number but a variety of numbers. Each time it is applied to a measurement situation it can be expected to produce a slightly different number or sets of numbers. The means of sets of numbers will differ among themselves, but to a lesser degree than the individual values. One must distinguish between natural variability and instability. Gross instability can arise from many sources, including lack of control of the process. Failure to control steps that introduce bias also can introduce variability. Thus, any variability in calibration, done to minimize bias, can produce variability of measured values." (Cheryl Cihon & John K Taylor, "Statistical Techniques for Data Analysis" 2nd. ed., 2005)

09 August 2025

🧭Business Intelligence: Perspectives (Part 33: Data Lifecycle for Analytics)

Business Intelligence Series

In the context of BI, Analytics and other data-related topics, the various parties usually talk about data ingestion, preparation, storage, analysis and visualization, often ignoring processes like data generation, collection, and interpretation. It’s also true that a broader discussion may shift the attention unnecessarily, though it’s important to increase people’s awareness in respect to data’s full lifecycle. Otherwise, many of the data solutions become a mix of castles built into the air, respectively structures of cards waiting for the next flurry to be blown away.

Data is generated continuously by organizations, their customers, vendors, and third parties, as part of a complex network of processes, systems and integrations that extend beyond their intended boundaries. Independently of their type, scope and various other characteristics, all processes consume and generate data at a rapid pace that steadily exceeds organizations’ capabilities to make good use of it.

There are also scenarios in which the data must be collected via surveys, interviews, forms, measurements or direct observations, and whatever processes are used to elicit some aspect of importance. The volume and other characteristics of data generated in this way may depend on the goals and objectives in scope, respectively the methods, procedures and even the methodologies used.

Data ingestion is the process of importing data from the various sources into a central or intermediary repository for storage, processing, analysis and visualization. The repository can be a data mart, warehouse, lakehouse, data lake or any other destination intended for the intermediary or the final intended destination of data. Moreover, data can have different levels of quality in respect to its intended usage.

Data storage refers to the systems and approaches used to securely retain, organize, and access data throughout its journey within the various layers of the infrastructure. It focuses on where and how data is stored, independently on whether that’s done on-premises, in the cloud or across hybrid environments.

Data preparation is the process of transforming the data into a form close to what is intended for analysis and visualization. It may involve data aggregation, enrichment, transposition and other operations that facilitate further steps. It’s probably the most important step in a data project given that the final outcome can have an important impact on data analysis and visualization, facilitating or impeding the respective processes.

Data analysis consists of a multitude of processes that attempt to harness value from data in its various forms of aggregation. The ultimate purpose is to infer meaningful information, respectively knowledge from the data augmented as insights. The road from raw data to these targeted outcomes is a tedious one, where recipes can help and imped altogether. Expecting value from any pile of data can easily become a costly illusion when data, processes and their usage is poorly understood and harnessed.

Data visualization is the means of presenting data and its characteristics in the form of figures, diagrams and other forms of representation that facilitate data’s navigation, perception and understanding for various purposes. Usually, the final purpose is fact-checking, decision-making, problem-solving, etc., though there is a multitude of steps in between. Especially in these areas there are mixed good and poor practices altogether.

Data interpretation is the attempt of drawing meaningful conclusions from the data, information and knowledge gained mainly from data analysis and visualization. It is often a subjective interpretation as it’s usually regarded from people’s understanding of the various facts as they are considered. The inferences made in the process can be a matter of gut feeling, respectively of mature analysis. It’s about sense-making, contextualization, critical thinking, pattern recognition, internalization and externalization, and other similar cognitive processes.

Previous Post <<||>> Next Post

17 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 54: Will AI Make Data Analysis Obsolete?)

Prompt Engineering Series

Prompt: "write a blogpost of 600 words on how AI may start making data analysis obsolete"

Introduction

Artificial intelligence (AI) is revolutionizing industries, automating tasks, and reshaping how businesses process information. One of the most debated topics in the data-driven world is whether AI will eventually make traditional data analysis obsolete. While AI is undeniably transforming the field, the reality is more nuanced than a simple replacement scenario.

The Rise of AI in Data Analysis

AI-powered analytics tools have demonstrated impressive capabilities in processing vast amounts of data, identifying patterns, and generating insights. These tools can:

Automate data cleaning and preprocessing, reducing the time analysts spend on tedious tasks.
Detect trends and anomalies faster than human analysts.
Generate predictive models that anticipate future outcomes based on historical data.
Provide real-time insights, allowing businesses to make quicker decisions.

AI-driven automation is particularly useful for repetitive tasks, such as sorting and structuring data, enabling analysts to focus on higher-level problem-solving.

How AI is Changing the Role of Data Analysts

Rather than making data analysts obsolete, AI is shifting their responsibilities. Analysts are increasingly becoming AI supervisors, guiding AI-generated insights, ensuring accuracy, and refining AI-driven solutions. Instead of manually analyzing every dataset, analysts are leveraging AI to enhance productivity and streamline workflows.

AI is also democratizing data analysis by enabling non-experts to generate insights using natural language queries. Low-code and no-code platforms powered by AI allow users to extract meaningful information without extensive technical knowledge. While this reduces the barrier to entry, it does not eliminate the need for skilled analysts who understand data integrity, business context, and strategic decision-making.

Limitations of AI in Data Analysis

Despite its advancements, AI still faces significant limitations in data analysis:

Lack of Contextual Understanding: AI can identify correlations, but it struggles with interpreting causation and business context. Human analysts bring intuition, industry expertise, and strategic thinking that AI cannot replicate.
Error-Prone Insights: AI-generated insights are not always reliable. Bias in training data, incorrect assumptions, and flawed algorithms can lead to misleading conclusions. Analysts play a crucial role in validating AI-generated findings.
Ethical and Security Concerns: AI-driven analytics must adhere to strict regulations regarding data privacy and security. Businesses need human oversight to ensure compliance with laws such as GDPR.
Complex Decision-Making: Large-scale business decisions require a combination of data-driven insights and human judgment. AI can assist in analysis, but human expertise is essential for interpreting results and making strategic choices.

The Future of Data Analysis in an AI-Driven World

While AI is transforming data analysis, it is unlikely to make analysts obsolete. Instead, the role of data professionals will evolve into a more strategic and supervisory position. Analysts will focus on guiding AI, ensuring data quality, and solving complex problems that require human intuition.

AI will continue to enhance productivity, automate repetitive tasks, and democratize data analysis, but human analysts will remain essential for innovation, security, and ethical decision-making. The future of data analysis is not about replacement - it’s about collaboration between AI and human intelligence.

Conclusion

AI is revolutionizing data analysis, but it is not eliminating the need for human analysts. Instead, it is reshaping the industry, making data processing more efficient and accessible while requiring human oversight for creativity, security, and complex problem-solving. Rather than fearing obsolescence, data analysts should embrace AI as a powerful tool that enhances their capabilities and expands the possibilities of data-driven decision-making.

Disclaimer: The whole text was generated by Copilot (under Windows 10) at the first attempt. This is just an experiment to evaluate feature's ability to answer standard general questions, independently on whether they are correctly or incorrectly posed. Moreover, the answers may reflect hallucinations and other types of inconsistent or incorrect reasoning.

Previous Post <<||>> Next Post

15 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 53: The Future of Business Intelligence - Will AI Make It Obsolete?)

Prompt Engineering Series

Prompt: "write a blogpost of 600 words on how AI may start making business intelligence obsolete"

Introduction

Business intelligence (BI) has long been the backbone of data-driven decision-making, helping organizations analyze trends, optimize operations, and gain competitive advantages. However, as artificial intelligence (AI) continues to evolve, many wonder whether traditional BI tools and methodologies will become obsolete. AI’s ability to process vast amounts of data, generate insights autonomously, and adapt in real time is reshaping the landscape of business analytics. But does this mean BI will disappear entirely, or will it simply evolve?

The Shift from Traditional BI to AI-Driven Analytics

Traditional BI relies on structured data, dashboards, and human interpretation to extract meaningful insights. Analysts and business leaders use BI tools to generate reports, visualize trends, and make informed decisions. However, AI is introducing a new paradigm - one where data analysis is automated, predictive, and adaptive.

AI-driven analytics can:

Process unstructured data from sources like social media, emails, and customer interactions.
Identify patterns and correlations that human analysts might overlook.
Provide real-time insights without requiring manual report generation.
Predict future trends using machine learning models.

These capabilities suggest that AI is not just enhancing BI - it is fundamentally transforming it.

Why AI Might Replace Traditional BI Tools

Several factors indicate that AI could make traditional BI tools obsolete:

Automation of Data Analysis: AI eliminates the need for manual data processing, allowing businesses to generate insights instantly. Traditional BI tools require human intervention to clean, structure, and interpret data, whereas AI can automate these processes.
Predictive and Prescriptive Analytics: While BI focuses on historical data, AI-driven analytics predict future trends and prescribe actions. Businesses can move beyond reactive decision-making and adopt proactive strategies based on AI-generated forecasts.
Natural Language Processing (NLP) for Data Queries: AI-powered systems enable users to ask questions in natural language rather than navigating complex dashboards. This makes data analysis more accessible to non-technical users, reducing reliance on BI specialists.
Continuous Learning and Adaptation: AI models improve over time, refining their predictions and insights based on new data. Traditional BI tools require manual updates and adjustments, whereas AI evolves autonomously.

Challenges and Limitations of AI in Business Intelligence

Despite AI’s advancements, there are reasons why BI may not become entirely obsolete:

Data Governance and Compliance: AI-driven analytics must adhere to strict regulations regarding data privacy and security. Businesses need human oversight to ensure compliance with laws such as GDPR.
Interpretability and Trust: AI-generated insights can sometimes be opaque, making it difficult for business leaders to trust automated recommendations. Traditional BI tools provide transparency in data analysis.
Human Expertise in Decision-Making: AI can generate insights, but human intuition and strategic thinking remain essential for complex business decisions. AI should complement, not replace, human expertise.

The Future: AI-Augmented Business Intelligence

Rather than making BI obsolete, AI is likely to augment and enhance business intelligence. The future of BI will involve AI-powered automation, predictive analytics, and real-time decision-making, but human oversight will remain crucial.

Organizations that embrace AI-driven BI will gain a competitive edge, leveraging automation while maintaining strategic control. The key is to integrate AI as a collaborative tool rather than a complete replacement for traditional BI methodologies.

Conclusion

AI is revolutionizing business intelligence, but it is unlikely to make it entirely obsolete. Instead, BI will evolve into a more automated, predictive, and adaptive system powered by AI. Businesses that integrate AI-driven analytics will benefit from faster insights, improved decision-making, and enhanced efficiency.

The future of AI is not about replacement - it’s about transformation. AI will redefine how businesses analyze data, but human expertise will remain essential in shaping strategic decisions.

Previous Post <<||>> Next Post

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

Business Intelligence Series

There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes.

There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

Previous Post <<||>> Next Post

References:
[1] Alexander Pope (cca. 1711) An Essay on Criticism

06 August 2024

🧭Business Intelligence: Perspectives (Part 16: On the Cusps of Complexity)

Business Intelligence Series

We live in a complex world, which makes it difficult to model and work with the complex models that attempt to represent it. Thus, we try to simplify it to the degree that it becomes processable and understandable for us, while further simplification is needed when we try to depict it by digital means that make it processable by machines, respectively by us. Whenever we simplify something, we lose some aspects, which might be acceptable in many cases, but create issues in a broader number of ways.

With each layer of simplification results a model that addresses some parts while ignoring some parts of it, which restricts models’ usability to the degree that makes them unusable. The more one moves toward the extremes of oversimplification or complexification, the higher the chances for models to become unusable.

This aspect is relevant also in what concerns the business processes we deal with. Many processes are oversimplified to the degree that we track the entry and exit points, respectively the quantitative aspects we are interested in. In theory this information should be enough when answering some business questions, though might be insufficient when one dives deeper into processes. One can try to approximate, however there are high chances that such approximations deviate too much from the value approximated, which can lead to strange outcomes.

Therefore, when a date or other values are important, organizations consider adding more fields to reflect the implemented process with higher accuracy. Unfortunately, unless we save a history of all the important changes in the data, it becomes challenging to derive the snapshots we need for our analyses. Moreover, it is more challenging to obtain consistent snapshots. There are systems which attempt to obtain such snapshots through the implementation of the processes, though also this approach involves some complexity and other challenges.

Looking at the way business processes are implemented (see ERP, CRM and other similar systems), the systems track the created, modified and a few other dates that allow only limited perspectives. The fields typically provide the perspectives we need for data analysis. For many processes, it would be interesting to track other events and maybe other values taken in between.

There is theoretical potential in tracking more detailed data, but also a complexity that’s difficult to transpose into useful information about the processes themselves. Despite tracking more data and the effort involved in such activities, processes can still behave like black boxes, especially when we have no or minimal information about the processes implemented in Information Systems.

There’s another important aspect - even if systems provide similar implementations of similar processes, the behavior of users can make an important difference. The best example is the behavior of people entering the relevant data only when a process closes and ignoring the steps happening in between (dates, price or quantity changes).

There is a lot of missing data/information not tracked by such a system, especially in what concerns users’ behavior. It’s true that such behavior can be tracked to some degree, though that happens only when data are modified physically. One can suppose that there are many activities happening outside of the system.

The data gathered represents only the projection of certain events, which might not represent accurately and completely the processes or users’ behavior. We have the illusion of transparency, though we work with black boxes. There can be a lot of effort happening outside of these borders.

Fortunately, we can handle oversimplified processes and data maintenance, though one can but wonder how many important things can be found beyond the oversimplifications we work with, respectively what we miss in the process.

Previous Post <<||>> Next Post

18 February 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part III: More of a One-Man Show)

Business Intelligence Series

Probably, in some organizations there are still recounted stories about a hero who knew so much about the business and was technically proficient that he/she was able to provide data-driven answers to most business questions. Unfortunately, the times of solo representations are for long gone - the world moves too fast, there are too many questions looking for an answer, many of them requiring a solution before the problem was actually defined, a whole infrastructure is needed to be able to harness the potential of technologies and data, the volume of knowledge required grows exponentially, etc.

One of the approaches of handling the knowledge gap between the initial and required knowledge in solving problems based on data is to build all the required knowledge in one person, either on the business or the technical side. More common is to hire a data analyst and build the knowledge in the respective resource, and the approach has great chances to work until the volume of work exceeds a person's limits. The data analyst is forced to request to have the workload prioritized, which might work in certain occasions, while in others one needs to compromise on quality and/or do overtime, and all the issues deriving from this.

There are also situations in which the complexity of the problem exceeds a person's ability to handle it, and that's not necessarily a matter of intelligence but of knowhow. Some organizations respond with complexity to complexity, while others are more creative and break the complexity in manageable pieces. In both cases, more resources are needed to cover the knowledge and resource gap. Hiring more data analysts can get the work done though it's not a recipe for success. The more diverse the team, the higher the chances to succeed, though again it's a matter of creativity and of covering the knowledge gaps. Sometimes, it's more productive to use the resources already available in organization, though this can involve other challenges.

Even if much of the knowledge gets documented, as soon the data analyst leaves the organization a void is created until a similar resource is able to fill it. Organizations can better cope with these challenges if they disseminate the knowledge between data professionals respectively within the business. The more resources are involved the higher the level of retention and higher the chances of reusing the knowledge. However, the more people are involved, the higher the costs, especially the one associated with the waste of effort.

Organizations can compromise by choosing 1-2 resources from each department to be involved in knowledge dissemination, ideally people with data and technology affinity. They shall become data citizens, people who use data, data processing and visualization for building solutions that enable their job. Data citizens are expected to act as showmen in their knowledge domain and do their magic whenever such requirements arise.

Having a whole team of data citizens opens new opportunities for organizations, though such resources will need beside domain knowledge and data literacy also technical knowledge. Unfortunately, many people will reach their limitations in this area. Besides the learning effort, understanding what good architecture, design and techniques means is unfortunately not for everybody, and here's where the concept of citizen data analyst or citizen scientist breaks, and this independently of the tools used.

A data citizen's effort works best in data discovery, exploration and visualization scenarios where the rapid creation of prototypes reduces the time from idea to solution. However, the results are personal solutions that need to be validated by a technical person, pieces of the solutions maybe redesigned and moved until enterprise solutions result.

Previous Post <<||>> Next Post

13 February 2024

🧭Business Intelligence: A One-Man Show (Part IV: Data Roles between Past and Future)

Business Intelligence Series

Databases nowadays are highly secure, reliable and available to a degree that reduces the involvement of DBAs to a minimum. The more databases and servers are available in an organization, and the older they are, the bigger the need for dedicated resources to manage them. The number of DBAs involved tends to be proportional with the volume of work required by the database infrastructure. However, if the infrastructure is in the cloud, managed by the cloud providers, it's enough to have a person in the middle who manages the communication between cloud provider(s) and the organization. The person doesn't even need to be a DBA, even if some knowledge in the field is usually recommended.

The requirement for a Data Architect comes when there are several systems in place and there're multiple projects to integrate or build around the respective systems. It'a also the question of what drives the respective requirement - is it the knowledge of data architectures, the supervision of changes, and/or the review of technical documents? The requirement is thus driven by the projects in progress and those waiting in the pipeline. Conversely, if all the systems are in the cloud, their integration is standardized or doesn't involve much architectural knowledge, the role becomes obsolete or at least not mandatory.

The Data Engineer role is a bit more challenging to define because it appeared in the context of cloud-based data architectures. It seems to be related to the data movement via ETL/ELT pipelines and of data processing and preparation for the various needs. Data modeling or data presentation knowledge isn't mandatory even if ideal. The role seems to overlap with the one of a Data Warehouse professional, be it a simple architect or developer. Role's knowhow depends also on the tools involved, because one thing is to build a solution based on a standard SQL Server, and another thing to use dedicated layers and architectures for the various purposes. Engineers' number should be proportional with the number of data entities involved.

Conversely, the existence of solutions that move and process the data as needed, can reduce the volume of work. Moreover, the use of AI-driven tools like Copilot might shift the focus from data to prompt engineering.

The Data Analyst role is kind of a Cinderella - it can involve upon case everything from requirements elicitation to reports writing and results' interpretation, respectively from data collection and data modeling to data visualization. If you have a special wish related to your data, just add it to the role! Analysts' number should be related to the number of issues existing in organization where the collection and processing of data could make a difference. Conversely, the Data Citizen, even if it's not a role but a desirable state of art, could absorb in theory the Data Analyst role.

The Data Scientist is supposed to reveal the gems of knowledge hidden in the data by using Machine Learning, Statistics and other magical tools. The more data available, the higher the chances of finding something, even if probably statistically insignificant or incorrect. The role makes sense mainly in the context of big data, even if some opportunities might be available at smaller scales. Scientists' number depends on the number of projects focused on the big questions. Again, one talks about the Data Scientist citizen.

The Information Designer role seems to be more about data visualization and presentation. It makes sense in the organizations that rely heavily on visual content. All the other organizations can rely on the default settings of data visualization tools, independently on whether AI is involved or not.

Previous Post <<||>> Next Post

18 April 2023

📊Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:

Data Analytics Marathon [1]

It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals.

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality.

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella?

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management.

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals?

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized.

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

Previous Post <<||>> Next Post

Resources:
[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

Self-Service BI (SSBI) is a form of Business Intelligence (BI) in which the users are enabled and empowered to explore and analyze the data, respectively build reports and visualizations on their own, with minimal IT support.

The Good: Modern SSBI tools like PowerBI, Tableau or Qlik Sense provide easy to use and rich functionality for data preparation, exploration, discovery, integration, modelling, visualization, and analysis. Moreover, they integrated the advances made in graphics, data storage and processing (e.g. in-memory processing, parallel processing), which allow addressing most of data requirements. With just a few drag-and-drops users can display details, aggregate data, identify trends and correlations between data. Slice-and-dice or passthrough features allow navigating the data across dimensions and different levels of details. In addition, the tools can leverage the existing data models available in data warehouses, data marts and other types of data repositories, including the rich set of open data available on the web.

With the right infrastructure, knowledge and skills users can better understand and harness the business data, using them to address business questions, they can make faster and smarter decisions rooted in data. SSBI offers the potential of increasing the value data have for the organization, while improving the time to value for data products (data models, reports, visualizations).

The Bad: In the 90s products like MS Excel or Access allowed users to build personal solutions to address gaps existing in processes and reporting. Upon case, the personal solutions gained in importance, starting to be used by more users to the degree that they become essential for the business. Thus, these islands of data and knowledge started to become a nightmare for the IT department, as they were supposed to be kept alike and backed-up. In addition, issues like security of data, inefficient data processing, duplication of data and effort, different versions of truth, urged the business to consolidate such solutions in standardized solutions.

Without an adequate strategy and a certain control over the outcomes of the SSBI initiatives, organization risk of reaching to the same deplorable state, with SSBI initiatives having the potential to bring more damage than the issues they can solve. Insufficient data quality and integration, unrealistic expectations, the communication problems between business and IT, as well insufficient training and support have the potential of making SSBI’s adoption more difficult.

The investment in adequate SSBI tool(s) might be small compared with the further changes that need to be done within the technical and logistical BI infrastructure. In addition, even if the role of IT is minimized, it doesn’t mean that IT needs to be left out of the picture. IT is still the owner of the IT infrastructure, it still needs to oversight the self-service processes and the flow of data, information and knowledge within the organization. From infrastructure to skillset, there are aspects of the SSBI that need to be addressed accordingly. The BI professional can’t be replaced entirely, though the scope of his work may shift to address new types of challenges.

Not understanding that SSBI initiatives are iterative, explorative in nature and require time to bring value, can put unnecessary pressure on those being part of it. Renouncing to SSBI initiatives without attempting to address the issues and stir them in the right direction hinder an organization and its employees’ potential to grow, with all the implication deriving from it.

The Ugly: Despite the benefits SSBI can bring, its adoption within organizations remains low. Whether it’s business’ credibility in own forces, or the inherent technical or logistical challenges, SSBI follows the BI trend of being a promise that seldom reaches its potential.

Previous Post <<||>> Next Post

25 December 2018

🔭Data Science: Data Scientists (Just the quotes)

"[...] be wary of analysts that try to quantify the unquantifiable." (Ralph Keeney & Raiffa Howard, "Decisions with Multiple Objectives: Preferences and Value Trade-offs", 1976)

"Most people like to believe something is or is not true. Great scientists tolerate ambiguity very well. They believe the theory enough to go ahead; they doubt it enough to notice the errors and faults so they can step forward and create the new replacement theory. If you believe too much you'll never notice the flaws; if you doubt too much you won't get started. It requires a lovely balance." (Richard W Hamming, "You and Your Research", 1986)

"Many new data scientists tend to rush past it to get their data into a minimally acceptable state, only to discover that the data has major quality issues after they apply their (potentially computationally intensive) algorithm and get a nonsense answer as output. (Sandy Ryza, "Advanced Analytics with Spark: Patterns for Learning from Data at Scale", 2009)

"Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdisciplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: 'there’s a lot of data, what can you make from it?'" (Mike Loukides, "What Is Data Science?", 2011)

"As data scientists, we prefer to interact with the raw data. We know how to import it, transform it, mash it up with other data sources, and visualize it. Most of your customers can’t do that. One of the biggest challenges of developing a data product is figuring out how to give data back to the user. Giving back too much data in a way that’s overwhelming and paralyzing is 'data vomit'. It’s natural to build the product that you would want, but it’s very easy to overestimate the abilities of your users. The product you want may not be the product they want." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"In an emergency, a data product that just produces more data is of little use. Data scientists now have the predictive tools to build products that increase the common good, but they need to be aware that building the models is not enough if they do not also produce optimized, implementable outcomes." (Jeremy Howard et al, "Designing Great Data Products", 2012)

"Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small." (Dhanurjay Patil, "Data Jujitsu: The Art of Turning Data into Product", 2012)

"More generally, a data scientist is someone who knows how to extract meaning from and interpret data, which requires both tools and methods from statistics and machine learning, as well as being human. She spends a lot of time in the process of collecting, cleaning, and munging data, because data is never clean. This process requires persistence, statistics, and software engineering skills - skills that are also necessary for understanding biases in the data, and for debugging logging output from code. Once she gets the data into shape, a crucial part is exploratory data analysis, which combines visualization and data sense. She’ll find patterns, build models, and algorithms - some with the intention of understanding product usage and the overall health of the product, and others to serve as prototypes that ultimately get baked back into the product. She may design experiments, and she is a critical part of data-driven decision making. She’ll communicate with team members, engineers, and leadership in clear language and with data visualizations so that even if her colleagues are not immersed in the data themselves, they will understand the implications." (Rachel Schutt, "Doing Data Science: Straight Talk from the Frontline", 2013)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost, "Data Science for Business", 2013)

"[...] a data scientist role goes beyond the collection and reporting on data; it must involve looking at a business The role of a data scientist goes beyond the collection and reporting on data. application or process from multiple vantage points and determining what the main questions and follow-ups are, as well as recommending the most appropriate ways to employ the data at hand." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"In terms of characteristics, a data scientist has an inquisitive mind and is prepared to explore and ask questions, examine assumptions and analyse processes, test hypotheses and try out solutions and, based on evidence, communicate informed conclusions, recommendations and caveats to stakeholders and decision makers." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"Repeated observations of the same phenomenon do not always produce the same results, due to random noise or error. Sampling errors result when our observations capture unrepresentative circumstances, like measuring rush hour traffic on weekends as well as during the work week. Measurement errors reflect the limits of precision inherent in any sensing device. The notion of signal to noise ratio captures the degree to which a series of observations reflects a quantity of interest as opposed to data variance. As data scientists, we care about changes in the signal instead of the noise, and such variance often makes this problem surprisingly difficult." (Steven S Skiena, "The Data Science Design Manual", 2017)

"Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, an dhow a data science solution to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"A data scientist should be able to wrangle, mung, manipulate, and consolidate datasets before performing calculations on that data that help us to understand it. Analysis is a broad term, but it's clear that the end result is knowledge of your dataset that you didn't have before you started, no matter how basic or complex. [...] A data scientist usually has to be able to apply statistical, mathematical, and machine learning models to data in order to explain it or perform some sort of prediction." (Andrew P McMahon, "Machine Learning Engineering with Python", 2021)

"Data scientists are advanced in their technical skills. They like to do coding, statistics, and so forth. In its purest form, data science is where an individual uses the scientific method on data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"The ideal data scientist is a multi-disciplinary person, persistent in pursuing the solution." (Anil Maheshwari, "Data Analytics Made Accessible", 2021)

"Overall [...] everyone also has a need to analyze data. The ability to analyze data is vital in its understanding of product launch success. Everyone needs the ability to find trends and patterns in the data and information. Everyone has a need to ‘discover or reveal (something) through detailed examination’, as our definition says. Not everyone needs to be a data scientist, but everyone needs to drive questions and analysis. Everyone needs to dig into the information to be successful with diagnostic analytics. This is one of the biggest keys of data literacy: analyzing data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product." (Hillary Mason)

"A data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician." (Josh Blumenstock) [attributed]

"All businesses could use a garden where Data Scientists plant seeds of possibility and water them with collaboration." (Damian Mingle)

"Data scientist (noun): Person who is better at statistics than any software engineer and better at software engineering than any statistician." (Josh Wills)

"Data Scientists should recall innovation often times is not providing fancy algorithms, but rather value to the customer." (Damian Mingle)

"Data Scientists should refuse to be defined by someone else's vision of what's possible." (Damian Mingle)

24 December 2018

🔭Data Science: Statistics (Just the Quotes)

"There are two aspects of statistics that are continually mixed, the method and the science. Statistics are used as a method, whenever we measure something, for example, the size of a district, the number of inhabitants of a country, the quantity or price of certain commodities, etc. […] There is, moreover, a science of statistics. It consists of knowing how to gather numbers, combine them and calculate them, in the best way to lead to certain results. But this is, strictly speaking, a branch of mathematics." (Alphonse P de Candolle, "Considerations on Crime Statistics", 1833)

"A judicious man looks at Statistics, not to get knowledge, but to save himself from having ignorance foisted on him." (Thomas Carlyle, "Chartism", 1840)

"What constitutes the well-being of a man? Many things; of which the wages he gets, and the bread he buys with them, are but one preliminary item. Grant, however, that the
wages were the whole; that once knowing the wages and the price of bread, we know all ; then what are the wages? Statistic Inquiry, in its present unguided condition, cannot
tell. The average rate of day's wages is not correctly ascertained for any portion of this country; not only not for half-centuries, it is not even ascertained anywhere for decades
or years: far from instituting comparisons with the past, the present itself is unknown to us." (Thomas Carlyle, "Chartism", 1840)

"Statistics has then for its object that of presenting a faithful representation of a state at a determined epoch." (Adolphe Quetelet, 1849)

"Most statistical arguments depend upon a few figures picked out at random." (William S Jevons, [letter to Richard Hutton] 1863)

"[Statistics] are the only tools by which an opening can be cut through the formidable thicket of difficulties that bars the path of those who pursue the Science of man." (Sir Francis Galton, "Natural Inheritance", 1889)

"[…] statistics is the science of the measurement of the social organism, regarded as a whole, in all its manifestations." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may rightly be called the science of averages. […] Great numbers and the averages resulting from them, such as we always obtain in measuring social phenomena, have great inertia. […] It is this constancy of great numbers that makes statistical measurement possible. It is to great numbers that statistical measurement chiefly applies." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may, for instance, be called the science of counting. Counting appears at first sight to be a very simple operation, which any one can perform or which can be done automatically; but, as a matter of fact, when we come to large numbers, e.g., the population of the United Kingdom, counting is by no means easy, or within the power of an individual; limits of time and place alone prevent it being so carried out, and in no way can absolute accuracy be obtained when the numbers surpass certain limits." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"Statistics may be defined as numerical statements of facts by means of which large aggregates are analyzed, the relations of individual units to their groups are ascertained, comparisons are made between groups, and continuous records are maintained for comparative purposes." (Melvin T Copeland. "Statistical Methods" [in: Harvard Business Studies, Vol. III, Ed. by Melvin T Copeland, 1917])

"Statistics may be regarded as (i) the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data." (Sir Ronald A Fisher, "Statistical Methods for Research Worker", 1925)

"The conception of statistics as the study of variation is the natural outcome of viewing the subject as the study of populations; for a population of individuals in all respects identical is completely described by a description of anyone individual, together with the number in the group. The populations which are the object of statistical study always display variations in one or more respects. To speak of statistics as the study of variation also serves to emphasise the contrast between the aims of modern statisticians and those of their predecessors." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"The statistical examination of a body of data is thus logically similar to the general alternation of inductive and deductive methods throughout the sciences. A hypothesis is conceived and defined with all necessary exactitude; its logical consequences are ascertained by a deductive argument; these consequences are compared with the available observations; if these are completely in, accord with the deductions, the hypothesis is justified at least until fresh and more stringent observations are available." (Sir Ronald A Fisher, "Statistical Methods for Research Workers", 1925)

"Statistics is a scientific discipline concerned with collection, analysis, and interpretation of data obtained from observation or experiment. The subject has a coherent structure based on the theory of Probability and includes many different procedures which contribute to research and development throughout the whole of Science and Technology." (Egon Pearson, 1936)

"All statistical analysis in business must aim at the control of action. The possible conclusions are: 1. Certain action must be taken. 2. No action is required. 3. Certain tendencies must be watched. 4. The analysis is not significant and either (a) certain further facts are required, or (b) there are no indications that further facts should be obtained." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1938)

"[Statistics] is both a science and an art. It is a science in that its methods are basically systematic and have general application; and an art in that their successful application depends to a considerable degree on the skill and special experience of the statistician, and on his knowledge of the field of application, e.g. economics." (Leonard H C Tippett, "Statistics", 1943)

"Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena. In this definition 'natural phenomena' includes all the happenings of the external world, whether human or not " (Sir Maurice G Kendall, "Advanced Theory of Statistics", Vol. 1, 1943)

"To some people, statistics is ‘quartered pies, cute little battleships and tapering rows of sturdy soldiers in diversified uniforms’. To others, it is columns and columns of numerical facts. Many regard it as a branch of economics. The beginning student of the subject considers it to be largely mathematics." (The Editors, "Statistics, The Physical Sciences and Engineering", The American Statistician, Vol. 2, No. 4, 1948)

"For the most part, Statistics is a method of investigation that is used when other methods are of no avail; it is often a last resort and a forlorn hope. A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. The surgeon must guard carefully against false incisions with his scalpel. Very often he has to sew up the patient as inoperable. The public knows too little about the statistician as a conscientious and skilled servant of true science." (Michael J Moroney, "Facts from Figures", 1951)

"Statistics is the name for that science and art which deals with uncertain inferences - which uses numbers to find out something about nature and experience." (Warren Weaver, 1952)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"In brief, the greatest care must be exercised in using any statistical data, especially when it has been collected by another agency. At all times, the statistician who uses published data must ask himself, by whom were the data collected, how and for what purpose?" (Alfred R Ilersic, "Statistics", 1959)

"Poor statistics may be attributed to a number of causes. There are the mistakes which arise in the course of collecting the data, and there are those which occur when those data are being converted into manageable form for publication. Still later, mistakes arise because the conclusions drawn from the published data are wrong. The real trouble with errors which arise during the course of collecting the data is that they are the hardest to detect." (Alfred R Ilersic, "Statistics", 1959)

"The statistics themselves prove nothing; nor are they at any time a substitute for logical thinking. There are […] many simple but not always obvious snags in the data to contend with. Variations in even the simplest of figures may conceal a compound of influences which have to be taken into account before any conclusions are drawn from the data." (Alfred R Ilersic, "Statistics", 1959)

"Many people use statistics as a drunkard uses a street lamp - for support rather than illumination. It is not enough to avoid outright falsehood; one must be on the alert to detect possible distortion of truth. One can hardly pick up a newspaper without seeing some sensational headline based on scanty or doubtful data." (Anna C Rogers, "Graphic Charts Handbook", 1961)

"[Statistics] is concerned with things we can count. In so far as things, persons, are unique or ill-defi ned, statistics are meaningless and statisticians silenced; in so far as things are similar and definite - so many workers over 25, so many nuts and bolts made during December - they can be counted and new statistical facts are born." (Maurice S Bartlett, "Essays on Probability and Statistics", 1962)

"Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena." (Sir Maurice G Kendall & Alan Stuart, "The Advanced Theory of Statistics", 1963)

"Statistics may be defined as the discipline concerned with the treatment of numerical data derived from groups of individuals." (Peter Armitage, "Statistical Methods in Medical Research", 1971)

"We provisionally define statistics as the study of how information should be employed to reflect on, and give guidance for action in, a practical situation involving uncertainty." (Vic Barnett, "Comparative Statistical Inference" 2nd Ed., 1982)

"Statistics is a tool. In experimental science you plan and carry out experiments, and then analyse and interpret the results. To do this you use statistical arguments and calculations. Like any other tool - an oscilloscope, for example, or a spectrometer, or even a humble spanner - you can use it delicately or clumsily, skillfully or ineptly. The more you know about it and understand how it works, the better you will be able to use it and the more useful it will be." (Roger Barlow, "Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences", 1989)

"Sometimes the proprietors of a bad model claim that parts of it are facts, not just beliefs. Evaluation then amounts to determining if facts support the claims, and disciplines like statistics have tools for this task. The difficulty of using statistical tools will vary depending on the problem." (James S Hodges, "Six (or So) Things You Can Do with a Bad Model", 1991)

"The science of statistics may be described as exploring, analyzing and summarizing data; designing or choosing appropriate ways of collecting data and extracting information from them; and communicating that information. Statistics also involves constructing and testing models for describing chance phenomena. These models can be used as a basis for making inferences and drawing conclusions and, finally, perhaps for making decisions." (Fergus Daly et al, "Elements of Statistics", 1995)

"Statistics is a general intellectual method that applies wherever data, variation, and chance appear. It is a fundamental method because data, variation and chance are omnipresent in modern life. It is an independent discipline with its own core ideas rather than, for example, a branch of mathematics. […] Statistics offers general, fundamental, and independent ways of thinking." (David S Moore, "Statistics among the Liberal Arts", Journal of the American Statistical Association, 1998)

"Statistics is the branch of mathematics that uses observations and measurements called data to analyze, summarize, make inferences, and draw conclusions based on the data gathered." (Allan G Bluman, "Probability Demystified", 2005)

"Sometimes the most important fit statistic you can get is ‘convergence not met’ - it can tell you something is wrong with your model." (Oliver Schabenberger, "Applied Statistics in Agriculture Conference", 2006)

"Put simply, statistics is a range of procedures for gathering, organizing, analyzing and presenting quantitative data. […] Essentially […], statistics is a scientific approach to analyzing numerical data in order to enable us to maximize our interpretation, understanding and use. This means that statistics helps us turn data into information; that is, data that have been interpreted, understood and are useful to the recipient. Put formally, for your project, statistics is the systematic collection and analysis of numerical data, in order to investigate or discover relationships among phenomena so as to explain, predict and control their occurrence." (Reva B Brown & Mark Saunders, "Dealing with Statistics: What You Need to Know", 2008)

"Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions." (Sheldon M Ross, "Introductory Statistics" 3rd Ed., 2009)

"What is so unconventional about the statistical way of thinking? First, statisticians do not care much for the popular concept of the statistical average; instead, they fixate on any deviation from the average. They worry about how large these variations are, how frequently they occur, and why they exist. [...] Second, variability does not need to be explained by reasonable causes, despite our natural desire for a rational explanation of everything; statisticians are frequently just as happy to pore over patterns of correlation. [...] Third, statisticians are constantly looking out for missed nuances: a statistical average for all groups may well hide vital differences that exist between these groups. Ignoring group differences when they are present frequently portends inequitable treatment. [...] Fourth, decisions based on statistics can be calibrated to strike a balance between two types of errors. Predictably, decision makers have an incentive to focus exclusively on minimizing any mistake that could bring about public humiliation, but statisticians point out that because of this bias, their decisions will aggravate other errors, which are unnoticed but serious. [...] Finally, statisticians follow a specific protocol known as statistical testing when deciding whether the evidence fits the crime, so to speak. Unlike some of us, they don’t believe in miracles. In other words, if the most unusual coincidence must be contrived to explain the inexplicable, they prefer leaving the crime unsolved." (Kaiser Fung, "Numbers Rule the World", 2010)

"Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions." (Ron Larson & Betsy Farber, "Elementary Statistics: Picturing the World" 5th Ed., 2011)

"Statistics is the discipline of using data samples to support claims about populations." (Allen B Downey, "Think Stats: Probability and Statistics for Programmers", 2011)

"[… ] statistics is about understanding the role that variability plays in drawing conclusions based on data. […] Statistics is not about numbers; it is about data - numbers in context. It is the context that makes a problem meaningful and something worth considering." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Statistics is the scientific discipline that provides methods to help us make sense of data. […] The field of statistics teaches us how to make intelligent judgments and informed decisions in the presence of uncertainty and variation." (Roxy Peck & Jay L Devore, "Statistics: The Exploration and Analysis of Data" 7th Ed, 2012)

"[…] statistics is a method of pursuing truth. At a minimum, statistics can tell you the likelihood that your hunch is true in this time and place and with these sorts of people. This type of pursuit of truth, especially in the form of an event’s future likelihood, is the essence of psychology, of science, and of human evolution." (Arthhur Aron et al, "Statistics for Phsychology" 6th Ed., 2012)

"Statistics is the scientific discipline that provides methods to help us make sense of data. Statistical methods, used intelligently, offer a set of powerful tools for gaining insight into the world around us." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. [...] Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models. [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data." (Alan Agresti & Christine Franklin, "Statistics: The Art and Science of Learning from Data" 3rd Ed., 2013)

"Statistics is a science that helps us make decisions and draw conclusions in the presence of variability." (Douglas C Montgomery & George C Runger, "Applied Statistics and Probability for Engineers" 6th Ed., 2014)

"Statistics is an integral part of the quantitative approach to knowledge. The field of statistics is concerned with the scientific study of collecting, organizing, analyzing, and drawing conclusions from data." (Kandethody M Ramachandran & Chris P Tsokos, "Mathematical Statistics with Applications in R" 2nd Ed., 2015)

"Statistics can be defined as a collection of techniques used when planning a data collection, and when subsequently analyzing and presenting data." (Birger S Madsen, "Statistics for Non-Statisticians", 2016)

"Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. […] Statistics is the science of learning from data." (Moore McCabe & Alwan Craig, "The Practice of Statistics for Business and Economics" 4th Ed., 2016)

"Statistics is the science of collecting, organizing, summarizing, and analyzing information to draw conclusions or answer questions. In addition, statistics is about providing a measure of confidence in any conclusions." (Michael Sullivan, "Statistics: Informed Decisions Using Data", 5th Ed., 2017)

"Estimates based on data are often uncertain. If the data were intended to tell us something about a wider population (like a poll of voting intentions before an election), or about the future, then we need to acknowledge that uncertainty. This is a double challenge for data visualization: it has to be calculated in some meaningful way and then shown on top of the data or statistics without making it all too cluttered." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"I believe that the backlash against statistics is due to four primary reasons. The first, and easiest for most people to relate to, is that even the most basic concepts of descriptive and inferential statistics can be difficult to grasp and even harder to explain. […] The second cause for vitriol is that even well-intentioned experts misapply the tools and techniques of statistics far too often, myself included. Statistical pitfalls are numerous and tough to avoid. When we can't trust the experts to get it right, there's a temptation to throw the baby out with the bathwater. The third reason behind all the hate is that those with an agenda can easily craft statistics to lie when they communicate with us […] And finally, the fourth cause is that often statistics can be perceived as cold and detached, and they can fail to communicate the human element of an issue." (Ben Jones, "Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations", 2020)

"Ideally, a decision maker or a forecaster will combine the outside view and the inside view - or, similarly, statistics plus personal experience. But it’s much better to start with the statistical view, the outside view, and then modify it in the light of personal experience than it is to go the other way around. If you start with the inside view you have no real frame of reference, no sense of scale - and can easily come up with a probability that is ten times too large, or ten times too small." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"If we don’t understand the statistics, we’re likely to be badly mistaken about the way the world is. It is all too easy to convince ourselves that whatever we’ve seen with our own eyes is the whole truth; it isn’t. Understanding causation is tough even with good statistics, but hopeless without them. [...] And yet, if we understand only the statistics, we understand little. We need to be curious about the world that we see, hear, touch, and smell, as well as the world we can examine through a spreadsheet." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The contradiction between what we see with our own eyes and what the statistics claim can be very real. […] The truth is more complicated. Our personal experiences should not be dismissed along with our feelings, at least not without further thought. Sometimes the statistics give us a vastly better way to understand the world; sometimes they mislead us. We need to be wise enough to figure out when the statistics are in conflict with everyday experience - and in those cases, which to believe." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The whole discipline of statistics is built on measuring or counting things. […] it is important to understand what is being measured or counted, and how. It is surprising how rarely we do this. Over the years, as I found myself trying to lead people out of statistical mazes week after week, I came to realize that many of the problems I encountered were because people had taken a wrong turn right at the start. They had dived into the mathematics of a statistical claim - asking about sampling errors and margins of error, debating if the number is rising or falling, believing, doubting, analyzing, dissecting - without taking the ti- me to understand the first and most obvious fact: What is being measured, or counted? What definition is being used?" (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

More quotes on "Statistics" at the-web-of-knowledge.blogspot.com.

SQL Troubles

Pages

24 May 2026

🖍️Cheryl Cihon - Collected Quotes

09 August 2025

🧭Business Intelligence: Perspectives (Part 33: Data Lifecycle for Analytics)

17 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 54: Will AI Make Data Analysis Obsolete?)

15 July 2025

🤖〽️Prompt Engineering: Copilot Unabridged (Part 53: The Future of Business Intelligence - Will AI Make It Obsolete?)

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

06 August 2024

🧭Business Intelligence: Perspectives (Part 16: On the Cusps of Complexity)

18 February 2024

🧭Business Intelligence: A Software Engineer's Perspective (Part III: More of a One-Man Show)

13 February 2024

🧭Business Intelligence: A One-Man Show (Part IV: Data Roles between Past and Future)

18 April 2023

📊Graphical Representation: Graphics We Live By I (The Analytics Marathon)

13 June 2020

🧭☯Business Intelligence: Self-Service BI (The Good, the Bad and the Ugly)

25 December 2018

🔭Data Science: Data Scientists (Just the quotes)

24 December 2018

🔭Data Science: Statistics (Just the Quotes)

About Me