Showing posts with label analytics. Show all posts
Showing posts with label analytics. Show all posts

18 April 2023

📊Graphical Representation: Graphics We Live By I (The Analytics Marathon)

Graphical Representation
Graphical Representation Series

In a diagram adapted from an older article [1], Brent Dykes, the author of "Effective Data Storytelling" [2], makes a parallel between Data Analytics and marathon running, considering that an organization must pass through the depicted milestones, the percentages representing how many organizations reach the respective milestones:



It's a nice visualization and the metaphor makes sense given that running a marathon requires a long-term strategy to address the gaps between the current and targeted physical/mental form and skillset required to run a marathon, respectively for approaching a set of marathons and each course individually. Similarly, implementing a Data Analytics initiative requires a Data Strategy supposed to address the gaps existing between current and targeted state of art, respectively the many projects run to reach organization's goals. 

It makes sense, isn't it? On the other side the devil lies in details and frankly the diagram raises several questions when is compared with practices and processes existing in organizations. This doesn't mean that the diagram is wrong, just that it doesn't seem to reflect entirely the reality. 

The percentages represent author's perception of how many organizations reach the respective milestones, probably in an repeatable manner (as there are several projects). Thus, only 10% have a data strategy, 100% collect data, 80% of them prepare the data, while at the opposite side only 15% communicate insight, respectively 5% act on information.

Considering only the milestones the diagram looks like a funnel and a capability maturity model (CMM). Typically, the CMMs are more complex than this, evolving with technologies' capabilities. All the mentioned milestones have a set of capabilities that increase in complexity and that usually help differentiated organization's maturity. Therefore, the model seems too simple for an actual categorization.  

Typically, data collection has a specific scope resuming to surveys, interviews and/or research. However, the definition can be extended to the storage of data within organizations. Thus, data collection as the gathering of raw data is mainly done as part of their value supporting processes, and given the degree of digitization of data, one can suppose that most organizations gather data for the different purposes, even if only a small part are maybe digitized.

Even if many organizations build data warehouses, marts, lakehouses, mashes or whatever architecture might be en-vogue these days, an important percentage of the reporting needs are covered by standard reports or reporting tools that access directly the source systems without data preparation or even data visualization. The first important question is what is understood by data analytics? Is it only the use of machine learning and statistical analysis? Does it resume only to pattern and insight finding or does it includes also what is typically considered under the Business Intelligence umbrella? 

Pragmatically thinking, Data Analytics should consider BI capabilities as well as its an extension of the current infrastructure to consider analytic capabilities. On the other side Data Warehousing and BI are considered together by DAMA as part of their Data Management methodology. Moreover, organizations may have a Data Strategy and a BI strategy, respectively a Data Analytics strategy as they might have different goals, challenges and bodies to support them. To make it even more complicated, an organization might even consider all these important topics as part of the Data or even Information Governance, or consider BI or Analytics without Data Management. 

So, a Data Strategy might or might not address Data Analytics at all. It's a matter of management philosophy, organizational structure, politics and other factors. Probably, having a strayegy related to data should count. Even if a written and communicated data-related strategy is recommended for all medium to big organizations, only a small percentage of them have one, while small organizations might ignore the topic completely.

At least in the past, data analysis and its various subcomponents was performed before preparing and visualizing the data, or at least in parallel with data visualization. Frankly, it's a strange succession of steps. Or does it refers to exploratory data analysis (EDA) from a statistical perspective, which requires statistical experience to model and interpret the facts? Moreover, data exploration and discovery happen usually in the early stages.

The most puzzling step is the last one - what does the author intended with it? Ideally, data should be actionable, at least that's what one says about KPIs, OKRs and other metrics. Does it make sense to extend Data Analytics into the decision-making process? Where does a data professional's responsibilities end and which are those boundaries? Or does it refer to the actions that need to be performed by data professionals? 

The natural step after communicating insight is for the management to take action and provide feedback. Furthermore, the decisions taken have impact on the artifacts built and a reevaluation of the business problem, assumptions and further components is needed. The many steps of analytics projects are iterative, some iterations affecting the Data Strategy as well. The diagram shows the process as linear, which is not the case.

For sure there's an interface between Data Analytics and Decision-Making and the processes associated with them, however there should be clear boundaries. E.g., it's a data professional's responsibility to make sure that the data/information is actionable and eventually advise upon it, though whether the entitled people act on it is a management topic. Not acting upon an information is also a decision. Overstepping boundaries can put the data professional into a strange situation in which he becomes responsible and eventually accountable for an action not taken, which is utopic.

The final question - is the last mile representative for the analytical process? The challenge is not the analysis and communication of data but of making sure that the feedback processes work and the changes are addressed correspondingly, that value is created continuously from the data analytics infrastructure, that data-related risks and opportunities are addressed as soon they are recognized. 

As any model, a diagram doesn't need to be correct to be useful and might not be even wrong in the right context and argumentation. A data analytics CMM might allow better estimates and comparison between organizations, though it can easily become more complex to use. Between the two models lies probably a better solution for modeling the data analytics process.

Resources:
[1] Brent Dykes (2022) "Data Analytics Marathon: Why Your Organization Must Focus On The Finish", Forbes (link)
[2] Brent Dykes (2019) Effective Data Storytelling: How to Drive Change with Data, Narrative and Visuals (link)

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

 

Business Intelligence

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

11 March 2021

💠🗒️Microsoft Azure: Azure Data Factory [Notes]

Azure Data Factory - Concept Map

Acronyms:
Azure Data Factory (ADF)
Continuous Integration/Continuous Deployment (CI/CD)
Extract Load Transform (ELT)
Extract Transform Load (ETL)
Independent Software Vendors (ISVs)
Operations Management Suite (OMS)
pay-as-you-go (PAYG)
SQL Server Integration Services (SSIS)

Resources:
[1] Microsoft (2020) "Microsoft Business Intelligence and Information Management: Design Guidance", by Rod College
[2] Microsoft (2021) Azure Data Factory [source]
[3] Microsoft (2018) Azure Data Factory: Data Integration in the Cloud [source]
[4] Microsoft (2021) Integrate data with Azure Data Factory or Azure Synapse Pipeline [source]
[10] Coursera (2021) Data Processing with Azure [source]
[11] Sudhir Rawat & Abhishek Narain (2019) "Understanding Azure Data Factory: Operationalizing Big Data and Advanced Analytics Solutions"

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

Data Management

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.  

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link

29 November 2019

🧭Business Intelligence: Perspectives (Part V: Data Soup - From BI to Analytics)

Business Intelligence Series
Business Intelligence Series

The days when everything was reduced to simple terminology like reports or queries are gone. One can see it in the market trends related to reporting or data, as well in the jargon soup the IT people use on the daily basis – Business Intelligence (BI), Data Mining (DM), Analytics, Data Science, Data Warehousing (DW), Machine Learning (ML), Artificial Intelligence (AI) and so on. What’s more confusing for the users and other spectators is the easiness with which all these concepts are used, sometimes interchangeably, and often it feels like nothing makes sense.

BI is used nowadays to refer to the technologies, architectures, methodologies, processes and practices used to transform data into what is desired as meaningful and useful information.  From its early beginnings in the 60s, the intelligence from Business Intelligence (BI) refers to the ability to apprehend the interrelationships of the facts to be processed (aka data) in such a way as to guide action towards a desired goal.

The main purpose of BI was and is to guide actions and provide a solid basis for decision making, aspect not necessarily reflected in the way organizations use their BI infrastructure. Except basic operational/tactical/strategic reports and metrics that reflect to a higher or lower degree organizations’ goals, BI often fails to provide the expected value. The causes are multiple ranging from an organizations maturity in devising a strategy and dividing it into SMART goals and objectives, to the misuse of technologies for the wrong purposes.

Despite the basic data analysis techniques, the rich visualizations and navigation functionality, BI fails often to deliver by itself more than ordinary and already known information. Information becomes valuable when it brings novelty, when it can be easily transformed into knowledge, or even better, when knowledge is extracted directly. To address the limitations of the BI a series of techniques appeared in parallel and coined in the 90s as Data Mining.

Mining is the process of obtaining something valuable from a resource. What DM tries to achieve as process is the extraction of knowledge in form patterns from the data by categorizing, clustering, identifying dependencies or anomalies. When compared with data analysis, the main characteristics of DM is the fact that is used to test models and hypotheses, and that it uses a set of semiautomatic and automatic out-of-the-box statistics packages, AI or predictive algorithms with applicability in different areas – Web,  text, speech, business processes, etc.

DM proved to be useful by allowing to build models rooted in historical data, models which allowed predicting outcome or behavior, however the models are pretty basic and there’s always a threshold beyond which they can’t go. Furthermore, the costs of preparing the data and of the needed infrastructure seem to be high compared with the benefits data mining provides. There are scenarios in which DM proves to bring benefit, while in others it raises more challenges than can solve. Privacy, security, misuse of information and the blind use of techniques without understanding the data or the models behind, are just some of such challenges.  

Information seems too common, while knowledge can become expensive to obtain. The middle way between the two found its future into another buzzword – analyticsthe systematic analysis of data or statistics using specific mathematical methods. Analytics combine the agility of data analysis techniques with the power of predictive and prescriptive techniques used in DM in discovering patterns into the data. Analytics attempts to identify why it happens by using a chain of inferences resulted from data’s analyzing and understanding. From another perspective analytics seems to be a rebranded and slightly enhanced version of BI.

31 December 2018

🔭Data Science: Big Data (Just the Quotes)

"If we gather more and more data and establish more and more associations, however, we will not finally find that we know something. We will simply end up having more and more data and larger sets of correlations." (Kenneth N Waltz, "Theory of International Politics Source: Theory of International Politics", 1979)

“There are those who try to generalize, synthesize, and build models, and there are those who believe nothing and constantly call for more data. The tension between these two groups is a healthy one; science develops mainly because of the model builders, yet they need the second group to keep them honest.” (Andrew Miall, “Principles of Sedimentary Basin Analysis”, 1984)

"Big data can change the way social science is performed, but will not replace statistical common sense." (Thomas Landsall-Welfare, "Nowcasting the mood of the nation", Significance 9(4), 2012)

"Big Data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it." (Edd Wilder-James, "What is big data?", 2012) [source]

"The secret to getting the most from Big Data isn’t found in huge server farms or massive parallel computing or in-memory algorithms. Instead, it’s in the almighty pencil." (Matt Ariker, "The One Tool You Need To Make Big Data Work: The Pencil", 2012)

"Big data is the most disruptive force this industry has seen since the introduction of the relational database." (Jeffrey Needham, "Disruptive Possibilities: How Big Data Changes Everything", 2013)

"No subjective metric can escape strategic gaming [...] The possibility of mischief is bottomless. Fighting ratings is fruitless, as they satisfy a very human need. If one scheme is beaten down, another will take its place and wear its flaws. Big Data just deepens the danger. The more complex the rating formulas, the more numerous the opportunities there are to dress up the numbers. The larger the data sets, the harder it is to audit them." (Kaiser Fung, "Numbersense: How To Use Big Data To Your Advantage", 2013)

"There is convincing evidence that data-driven decision-making and big data technologies substantially improve business performance. Data science supports data-driven decision-making - and sometimes conducts such decision-making automatically - and depends upon technologies for 'big data' storage and engineering, but its principles are separate." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Our needs going forward will be best served by how we make use of not just this data but all data. We live in an era of Big Data. The world has seen an explosion of information in the past decades, so much so that people and institutions now struggle to keep pace. In fact, one of the reasons for the attachment to the simplicity of our indicators may be an inverse reaction to the sheer and bewildering volume of information most of us are bombarded by on a daily basis. […] The lesson for a world of Big Data is that in an environment with excessive information, people may gravitate toward answers that simplify reality rather than embrace the sheer complexity of it." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"The other buzzword that epitomizes a bias toward substitution is 'big data'. Today’s companies have an insatiable appetite for data, mistakenly believing that more data always creates more value. But big data is usually dumb data. Computers can find patterns that elude humans, but they don’t know how to compare patterns from different sources or how to interpret complex behaviors. Actionable insights can only come from a human analyst (or the kind of generalized artificial intelligence that exists only in science fiction)." (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"We have let ourselves become enchanted by big data only because we exoticize technology. We’re impressed with small feats accomplished by computers alone, but we ignore big achievements from complementarity because the human contribution makes them less uncanny. Watson, Deep Blue, and ever-better machine learning algorithms are cool. But the most valuable companies in the future won’t ask what problems can be solved with computers alone. Instead, they’ll ask: how can computers help humans solve hard problems?" (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"As business leaders we need to understand that lack of data is not the issue. Most businesses have more than enough data to use constructively; we just don't know how to use it. The reality is that most businesses are already data rich, but insight poor." (Bernard Marr, Big Data: Using SMART Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance, 2015)

"Big data is based on the feedback economy where the Internet of Things places sensors on more and more equipment. More and more data is being generated as medical records are digitized, more stores have loyalty cards to track consumer purchases, and people are wearing health-tracking devices. Generally, big data is more about looking at behavior, rather than monitoring transactions, which is the domain of traditional relational databases. As the cost of storage is dropping, companies track more and more data to look for patterns and build predictive models." (Neil Dunlop, "Big Data", 2015)

"Big Data often seems like a meaningless buzz phrase to older database professionals who have been experiencing exponential growth in database volumes since time immemorial. There has never been a moment in the history of database management systems when the increasing volume of data has not been remarkable." (Guy Harrison, "Next Generation Databases: NoSQL, NewSQL, and Big Data", 2015)

"Dimensionality reduction is essential for coping with big data - like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. [...] A common complaint about big data is that the more data you have, the easier it is to find spurious patterns in it. This may be true if the data is just a huge set of disconnected entities, but if they’re interrelated, the picture changes." (Pedro Domingos, "The Master Algorithm", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"The human side of analytics is the biggest challenge to implementing big data." (Paul Gibbons, "The Science of Successful Organizational Change", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Underfitting is when a model doesn’t take into account enough information to accurately model real life. For example, if we observed only two points on an exponential curve, we would probably assert that there is a linear relationship there. But there may not be a pattern, because there are only two points to reference. [...] It seems that the best way to mitigate underfitting a model is to give it more information, but this actually can be a problem as well. More data can mean more noise and more problems. Using too much data and too complex of a model will yield something that works for that particular data set and nothing else." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"We are moving slowly into an era where Big Data is the starting point, not the end." (Pearl Zhu, "Digital Master: Debunk the Myths of Enterprise Digital Maturity", 2015)

"A popular misconception holds that the era of Big Data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data, and minimize bias. Even in a Big Data project, predictive models are typically developed and piloted with samples." (Peter C Bruce & Andrew G Bruce, "Statistics for Data Scientists: 50 Essential Concepts", 2016)

"Big data is, in a nutshell, large amounts of data that can be gathered up and analyzed to determine whether any patterns emerge and to make better decisions." (Daniel Covington, Analytics: Data Science, Data Analysis and Predictive Analytics for Business, 2016)

"Big Data processes codify the past. They do not invent the future. Doing that requires moral imagination, and that’s something only humans can provide. We have to explicitly embed better values into our algorithms, creating Big Data models that follow our ethical lead. Sometimes that will mean putting fairness ahead of profit." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"While Big Data, when managed wisely, can provide important insights, many of them will be disruptive. After all, it aims to find patterns that are invisible to human eyes. The challenge for data scientists is to understand the ecosystems they are wading into and to present not just the problems but also their possible solutions." (Cathy O'Neil, "Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy", 2016)

"Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are." (Seth Stephens-Davidowitz, "Everybody Lies: What the Internet Can Tell Us About Who We Really Are", 2017)

"Effects without an understanding of the causes behind them, on the other hand, are just bunches of data points floating in the ether, offering nothing useful by themselves. Big Data is information, equivalent to the patterns of light that fall onto the eye. Big Data is like the history of stimuli that our eyes have responded to. And as we discussed earlier, stimuli are themselves meaningless because they could mean anything. The same is true for Big Data, unless something transformative is brought to all those data sets… understanding." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"The term [Big Data] simply refers to sets of data so immense that they require new methods of mathematical analysis, and numerous servers. Big Data - and, more accurately, the capacity to collect it - has changed the way companies conduct business and governments look at problems, since the belief wildly trumpeted in the media is that this vast repository of information will yield deep insights that were previously out of reach." (Beau Lotto, "Deviate: The Science of Seeing Differently", 2017)

"There are other problems with Big Data. In any large data set, there are bound to be inconsistencies, misclassifications, missing data - in other words, errors, blunders, and possibly lies. These problems with individual items occur in any data set, but they are often hidden in a large mass of numbers even when these numbers are generated out of computer interactions." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Just as they did thirty years ago, machine learning programs (including those with deep neural networks) operate almost entirely in an associational mode. They are driven by a stream of observations to which they attempt to fit a function, in much the same way that a statistician tries to fit a line to a collection of points. Deep neural networks have added many more layers to the complexity of the fitted function, but raw data still drives the fitting process. They continue to improve in accuracy as more data are fitted, but they do not benefit from the 'super-evolutionary speedup'."  (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. [...] The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement. [...] A third data science myth is that modern data science software is easy to use, and so data science is easy to do. [...] The last myth about data science [...] is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization. Adopting data science can require significant investment in terms of developing data infrastructure and hiring staff with data science expertise. Furthermore, data science will not give positive results on every project." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Apart from the technical challenge of working with the data itself, visualization in big data is different because showing the individual observations is just not an option. But visualization is essential here: for analysis to work well, we have to be assured that patterns and errors in the data have been spotted and understood. That is only possible by visualization with big data, because nobody can look over the data in a table or spreadsheet." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"With the growing availability of massive data sets and user-friendly analysis software, it might be thought that there is less need for training in statistical methods. This would be naïve in the extreme. Far from freeing us from the need for statistical skills, bigger data and the rise in the number and complexity of scientific studies makes it even more difficult to draw appropriate conclusions. More data means that we need to be even more aware of what the evidence is actually worth." (David Spiegelhalter, "The Art of Statistics: Learning from Data", 2019)

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Making big data work is harder than it seems. Statisticians have spent the past two hundred years figuring out what traps lie in wait when we try to understand the world through data. The data are bigger, faster, and cheaper these days, but we must not pretend that the traps have all been made safe. They have not." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"Many people have strong intuitions about whether they would rather have a vital decision about them made by algorithms or humans. Some people are touchingly impressed by the capabilities of the algorithms; others have far too much faith in human judgment. The truth is that sometimes the algorithms will do better than the humans, and sometimes they won’t. If we want to avoid the problems and unlock the promise of big data, we’re going to need to assess the performance of the algorithms on a case-by-case basis. All too often, this is much harder than it should be. […] So the problem is not the algorithms, or the big datasets. The problem is a lack of scrutiny, transparency, and debate." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"The problem is the hype, the notion that something magical will emerge if only we can accumulate data on a large enough scale. We just need to be reminded: Big data is not better; it’s just bigger. And it certainly doesn’t speak for itself." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"We live on islands surrounded by seas of data. Some call it 'big data'. In these seas live various species of observable phenomena. Ideas, hypotheses, explanations, and graphics also roam in the seas of data and can clarify the waters or allow unsupported species to die. These creatures thrive on visual explanation and scientific proof. Over time new varieties of graphical species arise, prompted by new problems and inner visions of the fishers in the seas of data." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

"Visualizations can remove the background noise from enormous sets of data so that only the most important points stand out to the intended audience. This is particularly important in the era of big data. The more data there is, the more chance for noise and outliers to interfere with the core concepts of the data set." (Kate Strachnyi, "ColorWise: A Data Storyteller’s Guide to the Intentional Use of Color", 2023)

"Visualisation is fundamentally limited by the number of pixels you can pump to a screen. If you have big data, you have way more data than pixels, so you have to summarise your data. Statistics gives you lots of really good tools for this." (Hadley Wickham)

10 February 2018

🔬Data Science: Data Mining (Definitions)

"The non-trivial extraction of implicit, previously unknown, and potentially useful information from data" (Frawley et al., "Knowledge discovery in databases: An overview", 1991)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is the process of examining large amounts of aggregated data. The objective of data mining is to either predict what may happen based on trends or patterns in the data or to discover interesting correlations in the data." (Microsoft Corporation, "Microsoft SQL Server 7.0 Data Warehouse Training Kit", 2000)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals", 2001)

"A class of undirected queries, often against the most atomic data, that seek to find unexpected patterns in the data. The most valuable results from data mining are clustering, classifying, estimating, predicting, and finding things that occur together. There are many kinds of tools that play a role in data mining. The principal tools include decision trees, neural networks, memory- and cased-based reasoning tools, visualization tools, genetic algorithms, fuzzy logic, and classical statistics. Generally, data mining is a client of the data warehouse." (Ralph Kimball & Margy Ross, "The Data Warehouse Toolkit" 2nd Ed., 2002)

"The discovery of information hidden within data." (William A Giovinazzo, "Internet-Enabled Business Intelligence", 2002)

"the process of extracting valid, authentic, and actionable information from large databases." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Advanced analysis or data mining is the analysis of detailed data to detect patterns, behaviors, and relationships in data that were previously only partially known or at times totally unknown." (Margaret Y Chu, "Blissful Data", 2004)

"Analysis of detail data to discover relationships, patterns, or associations between values." (Margaret Y Chu, "Blissful Data ", 2004)

"An information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling" 2nd Ed., 2005)

"the process of analyzing large amounts of data in search of previously undiscovered business patterns." (William H Inmon, "Building the Data Warehouse", 2005)

"A type of advanced analysis used to determine certain patterns within data. Data mining is most often associated with predictive analysis based on historical detail, and the generation of models for further analysis and query." (Jill Dyché & Evan Levy, "Customer Data Integration", 2006)

"Refers to the process of identifying nontrivial facts, patterns and relationships from large databases. The databases have often been put together for a different purpose from the data mining exercise." (Glenn J Myatt, "Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining", 2006)

"Data mining is the process of discovering implicit patterns in data stored in data warehouse and using those patterns for business advantage such as predicting future trends." (S. Sumathi & S. Esakkirajan, "Fundamentals of Relational Database Management Systems", 2007)

"Digging through data (usually in a data warehouse or data mart) to identify interesting patterns." (Rod Stephens, "Beginning Database Design Solutions", 2008)

"Intelligently analyzing data to extract hidden trends, patterns, and information. Commonly used by statisticians, data analysts and Management Information Systems communities." (Craig F Smith & H Peter Alesso, "Thinking on the Web: Berners-Lee, Gödel and Turing", 2008)

"The process of extracting valid, authentic, and actionable information from large databases." (Darril Gibson, "MCITP SQL Server 2005 Database Developer All-in-One Exam Guide", 2008)

"The process of retrieving relevant data to make intelligent decisions." (Robert D Schneider & Darril Gibson, "Microsoft SQL Server 2008 All-in-One Desk Reference For Dummies", 2008)

"A process that minimally has four stages: (1) data preparation that may involve 'data cleaning' and even 'data transformation', (2) initial exploration of the data, (3) model building or pattern identification, and (4) deployment, which means subjecting new data to the 'model' to predict outcomes of cases found in the new data." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Automatically searching large volumes of data for patterns or associations." (Mark Olive, "SHARE: A European Healthgrid Roadmap", 2009)

"The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc.)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"A data-driven approach to analysis and prediction by applying sophisticated techniques and algorithms to discover knowledge." (Paulraj Ponniah, "Data Warehousing Fundamentals for IT Professionals", 2010) 

"A way of extracting knowledge from a database by searching for correlations in the data and presenting promising hypotheses to the user for analysis and consideration." (Toby J Teorey, "Database Modeling and Design" 4th Ed., 2010)

"The process of using mathematical algorithms (usually implemented in computer software) to attempt to transform raw data into information that is not otherwise visible (for example, creating a query to forecast sales for the future based on sales from the past)." (Ken Withee, "Microsoft Business Intelligence For Dummies", 2010)

"A process that employs automated tools to analyze data in a data warehouse and other sources and to proactively identify possible relationships and anomalies." (Carlos Coronel et al, "Database Systems: Design, Implementation, and Management" 9th Ed., 2011)

"Process of analyzing data from different perspectives and summarizing it into useful information (e.g., information that can be used to increase revenue, cuts costs, or both)." (Linda Volonino & Efraim Turban, "Information Technology for Management" 8th Ed., 2011)

"The process of sifting through large amounts of data using pattern recognition, fuzzy logic, and other knowledge discovery statistical techniques to identify previously unknown, unsuspected, and potentially meaningful data content relationships and trends." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

"Data mining, a branch of computer science, is the process of extracting patterns from large data sets by combining statistical analysis and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage." (T T Wong & Loretta K W Sze, "A Neuro-Fuzzy Partner Selection System for Business Social Networks", 2012)

"Field of analytics with structured data. The model inference process minimally has four stages: data preparation, involving data cleaning, transformation and selection; initial exploration of the data; model building or pattern identification; and deployment, putting new data through the model to obtain their predicted outcomes." (Gary Miner et al, "Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications", 2012)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft, "SQL Server 2012 Glossary", 2012)

"The process of exploring and analyzing large amounts of data to find patterns." (Marcia Kaufman et al, "Big Data For Dummies", 2013)

"An umbrella term for analytic techniques that facilitate fast pattern discovery and model building, particularly with large datasets." (Meta S Brown, "Data Mining For Dummies", 2014)

"Analysis of large quantities of data to find patterns such as groups of records, unusual records, and dependencies" (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

"The practice of analyzing big data using mathematical models to develop insights, usually including machine learning algorithms as opposed to statistical methods."(Brenda L Dietrich et al, "Analytics Across the Enterprise", 2014)

"Data mining is the analysis of data for relationships that have not previously been discovered." (Piyush K Shukla & Madhuvan Dixit, "Big Data: An Emerging Field of Data Engineering", Handbook of Research on Security Considerations in Cloud Computing, 2015)

"A methodology used by organizations to better understand their customers, products, markets, or any other phase of the business." (Adam Gordon, "Official (ISC)2 Guide to the CISSP CBK" 4th Ed., 2015)

"Extracting information from a database to zero in on certain facts or summarize a large amount of data." (Faithe Wempen, "Computing Fundamentals: Introduction to Computers", 2015)

"It refers to the process of identifying and extracting patterns in large data sets based on artificial intelligence, machine learning, and statistical techniques." (Hamid R Arabnia et al, "Application of Big Data for National Security", 2015)

"The process of exploring and analyzing large amounts of data to find patterns." (Judith S Hurwitz, "Cognitive Computing and Big Data Analytics", 2015)

"Term used to describe analyzing large amounts of data to find patterns, correlations, and similarities." (Brittany Bullard, "Style and Statistics", 2016)

"The process of extracting meaningful knowledge from large volumes of data contained in data warehouses." (K  N Krishnaswamy et al, "Management Research Methodology: Integration of Principles, Methods and Techniques", 2016)

"A class of analytical applications that help users search for hidden patterns in a data set. Data mining is a process of analyzing large amounts of data to identify data–content relationships. Data mining is one tool used in decision support special studies. This process is also known as data surfing or knowledge discovery." (Daniel J Power & Ciara Heavin, "Decision Support, Analytics, and Business Intelligence" 3rd Ed., 2017)

"The process of collecting, searching through, and analyzing a large amount of data in a database to discover patterns or relationships." (Jonathan Ferrar et al, "The Power of People: Learn How Successful Organizations Use Workforce Analytics To Improve Business Performance", 2017)

"Data mining involves finding meaningful patterns and deriving insights from large data sets. It is closely related to analytics. Data mining uses statistics, machine learning, and artificial intelligence techniques to derive meaningful patterns." (Amar Sahay, "Business Analytics" Vol. I, 2018)

"The analysis of the data held in data warehouses in order to produce new and useful information." (Shon Harris & Fernando Maymi, "CISSP All-in-One Exam Guide" 8th Ed., 2018)

"The process of collecting critical business information from a data source, correlating the information, and uncovering associations, patterns, and trends." (Sybase, "Open Server Server-Library/C Reference Manual", 2019)

"The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems." (Dmitry Korzun et al, "Semantic Methods for Data Mining in Smart Spaces", 2019)

"A technique using software tools geared for the user who typically does not know exactly what he's searching for, but is looking for particular patterns or trends. Data mining is the process of sifting through large amounts of data to produce data content relationships. It can predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. This is also known as data surfing." (Information Management)

"An analytical process that attempts to find correlations or patterns in large data sets for the purpose of data or knowledge discovery." (NIST SP 800-53)

"Extracting previously unknown information from databases and using that data for important business decisions, in many cases helping to create new insights." (Solutions Review)

"is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"the process of analyzing large batches of data to find patterns and instances of statistical significance. By utilizing software to look for patterns in large batches of data, businesses can learn more about their customers and develop more effective strategies for acquisition, as well as increase sales and decrease overall costs." (Insight Software)

"The process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools." (Microsoft)

"The process of pulling actionable insight out of a set of data and putting it to good use. This includes everything from cleaning and organizing the data; to analyzing it to find meaningful patterns and connections; to communicating those connections in a way that helps decision-makers improve their product or organization." (KDnuggets)

"Data mining is the process of analyzing hidden patterns of data according to different perspectives for categorization into useful information, which is collected and assembled in common areas, such as data warehouses, for efficient analysis, data mining algorithms, facilitating business decision making and other information requirements to ultimately cut costs and increase revenue. Data mining is also known as data discovery and knowledge discovery." (Techopedia)

"Data mining is an automated analytical method that lets companies extract usable information from massive sets of raw data. Data mining combines several branches of computer science and analytics, relying on intelligent methods to uncover patterns and insights in large sets of information." (Sisense) [source]

"Data mining is the process of analyzing data from different sources and summarizing it into relevant information that can be used to help increase revenue and decrease costs. Its primary purpose is to find correlations or patterns among dozens of fields in large databases." (Logi Analytics) [source]

"Data mining is the process of analyzing massive volumes of data to discover business intelligence that helps companies solve problems, mitigate risks, and seize new opportunities." (Talend) [source]

"Data Mining is the process of collecting data, aggregating it according to type and sorting through it to identify patterns and predict future trends." (Accenture)

"Data mining is the process of discovering meaningful correlations, patterns and trends by sifting through large amounts of data stored in repositories. Data mining employs pattern recognition technologies, as well as statistical and mathematical techniques." (Gartner)

"Data mining is the process of extracting relevant patterns, deviations and relationships within large data sets to predict outcomes and glean insights. Through it, companies convert big data into actionable information, relying upon statistical analysis, machine learning and computer science." (snowflake) [source]

"Data mining is the work of analyzing business information in order to discover patterns and create predictive models that can validate new business insights. […] Unlike data analytics, in which discovery goals are often not known or well defined at the outset, data mining efforts are usually driven by a specific absence of information that can’t be satisfied through standard data queries or reports. Data mining yields information from which predictive models can be derived and then tested, leading to a greater understanding of the marketplace." (Informatica) [source]

02 December 2015

🪙Business Intelligence: Analytics (Just the Quotes)

"Data are essential, but performance improvements and competitive advantage arise from analytics models that allow managers to predict and optimize outcomes. More important, the most effective approach to building a model rarely starts with the data; instead it originates with identifying the business opportunity and determining how the model can improve performance." (Dominic Barton & David Court, "Making Advanced Analytics Work for You", 2012) 

"Even with simple and usable models, most organizations will need to upgrade their analytical skills and literacy. Managers must come to view analytics as central to solving problems and identifying opportunities - to make it part of the fabric of daily operations." (Dominic Barton & David Court, "Making Advanced Analytics Work for You", 2012)

"There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"The human side of analytics is the biggest challenge to implementing big data." (Paul Gibbons, "The Science of Successful Organizational Change", 2015)

"One important thing to bear in mind about the outputs of data science and analytics is that in the vast majority of cases they do not uncover hidden patterns or relationships as if by magic, and in the case of predictive analytics they do not tell us exactly what will happen in the future. Instead, they enable us to forecast what may come. In other words, once we have carried out some modelling there is still a lot of work to do to make sense out of the results obtained, taking into account the constraints and assumptions in the model, as well as considering what an acceptable level of reliability is in each scenario." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017)

"One of the biggest truths about the real–time analytics is that nothing is actually real–time; it's a myth. In reality, it's close to real–time. Depending upon the performance and ability of a solution and the reduction of operational latencies, the analytics could be close to real–time, but, while day-by-day we are bridging the gap between real–time and near–real–time, it's practically impossible to eliminate the gap due to computational, operational, and network latencies." (Shilpi Saxena & Saurabh Gupta, "Practical Real-time Data Processing and Analytics", 2017)

"The tension between bias and variance, simplicity and complexity, or underfitting and overfitting is an area in the data science and analytics process that can be closer to a craft than a fixed rule. The main challenge is that not only is each dataset different, but also there are data points that we have not yet seen at the moment of constructing the model. Instead, we are interested in building a strategy that enables us to tell something about data from the sample used in building the model." (Jesús Rogel-Salazar, "Data Science and Analytics with Python", 2017) 

"Big data is revolutionizing the world around us, and it is easy to feel alienated by tales of computers handing down decisions made in ways we don’t understand. I think we’re right to be concerned. Modern data analytics can produce some miraculous results, but big data is often less trustworthy than small data. Small data can typically be scrutinized; big data tends to be locked away in the vaults of Silicon Valley. The simple statistical tools used to analyze small datasets are usually easy to check; pattern-recognizing algorithms can all too easily be mysterious and commercially sensitive black boxes." (Tim Harford, "The Data Detective: Ten easy rules to make sense of statistics", 2020)

"For advanced analytics, a well-designed data pipeline is a prerequisite, so a large part of your focus should be on automation. This is also the most difficult work. To be successful, you need to stitch everything together." (Piethein Strengholt, "Data Management at Scale: Best Practices for Enterprise Architecture", 2020)

"Data literacy is not a change in an individual’s abilities, talents, or skills within their careers, but more of an enhancement and empowerment of the individual to succeed with data. When it comes to data and analytics succeeding in an organization’s culture, the increase in the workforces’ skills with data literacy will help individuals to succeed with the strategy laid in front of them. In this way, organizations are not trying to run large change management programs; the process is more of an evolution and strengthening of individual’s talents with data. When we help individuals do more with data, we in turn help the organization’s culture do more with data." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"In the world of data and analytics, people get enamored by the nice, shiny object. We are pulled around by the wind of the latest technology, but in so doing we are pulled away from the sound and intelligent path that can lead us to data and analytical success. The data and analytical world is full of examples of overhyped technology or processes, thinking this thing will solve all of the data and analytical needs for an individual or organization. Such topics include big data or data science. These two were pushed into our minds and down our throats so incessantly over the past decade that they are somewhat of a myth, or people finally saw the light. In reality, both have a place and do matter, but they are not the only solution to your data and analytical needs. Unfortunately, though, organizations bit into them, thinking they would solve everything, and were left at the alter, if you will, when it came time for the marriage of data and analytical success with tools." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Pure data science is the use of data to test, hypothesize, utilize statistics and more, to predict, model, build algorithms, and so forth. This is the technical part of the puzzle. We need this within each organization. By having it, we can utilize the power that these technical aspects bring to data and analytics. Then, with the power to communicate effectively, the analysis can flow throughout the needed parts of an organization." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.