Showing posts with label data models. Show all posts
Showing posts with label data models. Show all posts

16 October 2024

🧭💹Business Intelligence: Perspectives (Part XVIII: There’s More to Noise)

Business Intelligence Series
Business Intelligence Series

Visualizations should be built with an audience's characteristics in mind! Upon case, it might be sufficient to show only values or labels of importance (minima, maxima, inflexion points, exceptions, trends), while other times it might be needed to show all or most of the values to provide an accurate extended perspective. It even might be useful to allow users switching between the different perspectives to reduce the clutter when navigating the data or look at the patterns revealed by the clutter. 

In data-based storytelling are typically shown the points, labels and further elements that support the story, the aspects the readers should focus on, though this approach limits the navigability and users’ overall experience. The audience should be able to compare magnitudes and make inferences based on what is shown, and the accurate decoding shouldn’t be taken as given, especially when the audience can associate different meanings to what’s available and what’s missing. 

In decision-making, selecting only some well-chosen values or perspectives to show might increase the chances for a decision to be made, though is this equitable? Cherry-picking may be justified by the purpose, though is in general not a recommended practice! What is not shown can be as important as what is shown, and people should be aware of the implications!

One person’s noise can be another person’s signal. Patterns in the noise can provide more insight compared with the trends revealed in the "unnoisy" data shown! Probably such scenarios are rare, though it’s worth investigating what hides behind the noise. The choice of scale, the use of special types of visualizations or the building of models can reveal more. If it’s not possible to identify automatically such scenarios using the standard software, the users should have the possibility of changing the scale and perspective as seems fit. 

Identifying patterns in what seems random can prove to be a challenge no matter the context and the experience in the field. Occasionally, one might need to go beyond the general methods available and statistical packages can help when used intelligently. However, a presenter’s challenge is to find a plausible narrative around the findings and communicate it further adequately. Additional capabilities must be available to confirm the hypotheses framed and other aspects related to this approach.

It's ideal to build data models and a set of visualizations around them. Most probable some noise may be removed in the process, while other noise will be further investigated. However, this should be done through adjustable visual filters because what is removed can be important as well. Rare events do occur, probably more often than we are aware and they may remain hidden until we find the right perspective that takes them into consideration. 

Probably, some of the noise can be explained by special events that don’t need to be that rare. The challenge is to identify those parameters, associations, models and perspectives that reveal such insights. One’s gut feeling and experience can help in this direction, though novel scenarios can surprise us as well.

Not in every set of data one can find patterns, respectively a story trying to come out. Whether we can identify something worth revealing depends also on the data available at our disposal, respectively on whether the chosen data allow identifying significant patterns. Occasionally, the focus might be too narrow, too wide or too shallow. It’s important to look behind the obvious, to look at data from different perspectives, even if the data seems dull. It’s ideal to have the tools and knowledge needed to explore such cases and here the exposure to other real-life similar scenarios is probably critical!

06 May 2024

🧭🏭Business Intelligence: Microsoft Fabric (Part III: The Metrics Layer [new feature])

Introduction

One of the announcements of this year's Microsoft Fabric Community first conference was the introduction of a metrics layer in Fabric which "allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse" [1]. As it seems, the information content provided at the conference was kept to a minimum given that the feature is still in private preview, though several webcasts start to catch up on the topic (see [2], [4]). Moreover, as part of their show, the Explicit Measures (@PowerBITips) hosts had Carly Newsome as invitee, the manager of the project, who unveiled more details about the project and the feature, details which became the main source for the information below. 

The idea of a metric layer or metric store is not new, data professionals occasionally refer to their structure(s) of metrics as such. The terms gained weight in their modern conception relatively recently in 2021-2022 (see [5], [6], [7], [8], [10]). Within the modern data stack, a metrics layer or metric store is an abstraction layer available between the data store(s) and end users. It allows to centrally define, store, and manage business metrics. Thus, it allows us to standardize and enforce a single source of truth (SSoT), respectively solve several issues existing in the data stacks. As Benn Stancil earlier remarked, the metrics layer is one of the missing pieces from the modern data stack (see [10]).

Microsoft's Solution

Microsoft's business case for metrics layer's implementation is based on three main ideas (1) duplicate measures contribute to poor data quality, (2) complex data models hinder self-service, (3) reduce data silos in Power BI. In Microsoft's conception the metric layer provides several benefits: consistent definitions and descriptions, easy management via management views, searchable and discoverable metrics, respectively assure trust through indicators. 

For this feature's implementation Microsoft introduces a new Fabric Item called a metric set that allows to group several (business) metrics together as part of a mini-model that can be tailored to the needs of a subset of end-users and accessed by them via the standard tools already available. The metric set becomes thus a mini-model. Such mini-models allow to break down and reduce the overall complexity of semantic models, while being easy to evolve and consume. The challenge will become then on how to break down existing and future semantic models into nonoverlapping mini-models, creating in extremis a partition (see the Lego metaphor for data products). The idea of mini-models is not new, [12] advocating the idea of using a Master Model, a technique for creating derivative tabular models based on a single tabular solution.

A (business) metric is a way to elevate the measures from the various semantic models existing in the organization within the mini-model defined by the metric set. A metric can be reused in other fabric artifacts - currently in new reports on the Power BI service, respectively in notebooks by copying the code. Reusing metrics in other measures can mean that one can chain metrics and the changes made will be further propagated downstream. 

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

Every metric is tied to the original semantic model which allows thus to track how a metric is used across the solutions and, looking forward to Purview, to identify data's lineage. A measure is related to a "table", the source from which the measure came from.

Users' Perspective

The Metrics Layer feature is available in Microsoft Fabric service for Power BI within the Metrics menu element next to Scorecards. One starts by creating a metric set in an existing workspace, an operation which creates the actual artifact, to which the individual metrics are added. To create a metric, a user with build permissions can navigate through the semantic models across different workspaces he/she has access to, pick a measure from one of them and elevate it to a metric, copying in the process its measure's definition and description. In this way the metric will always point back to the measure from the semantic model, while the metrics thus created are considered as a related collection and can be shared around accordingly. 

Once a metric is added to the metric set, one can add in edit mode dimensions to it (e.g. Date, Category, Product Id, etc.). One can then further explore a metric's output and add filters (e.g. concentrate on only one product or category) point from which one can slice-and-dice the data as needed.

There is a panel where one can see where the metric has been used (e.g. in reports, scorecards, and other integrations), when was last time refreshed, respectively how many times was used. Thus, one has the most important information in one place, which is great for developers as well as for the users. Probably, other metadata will be added, such as whether an increase in the metric would be favorable or unfavorable (like in Tableau Pulse, see [13]) or maybe levels of criticality, an unit of measure, or maybe its type - simple metric, performance indicator (PI), result indicator (RI), KPI, KRI etc.

Metrics can be persisted to the OneLake by saving their output to a delta table into the lakehouse. As demonstrated in the presentation(s), with just a copy-paste and a small piece of code one can materialize the data into a lakehouse delta table, from where the data can be reused as needed. Hopefully, the process will be further automated. 

One can consume metrics and metrics sets also in Power BI Desktop, where a new menu element called Metric sets was added under the OneLake data hub, which can be used to connect to a metric set from a Semantic model and select the metrics needed for the project. 

Tapping into the available Power BI solutions is done via an integration feature based on Sempy fabric package, a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric [11].

Further Thoughts

When dealing with a new feature, a natural idea comes to mind: what challenges does the feature involve, respectively how can it be misused? Given that the metrics layer can be built within a workspace and that it can tap into the existing measures, this means that one can built on the existing infrastructure. However, this can imply restructuring, refactoring, moving, and testing a lot of code in the process, hopefully with minimal implications for the solutions already available. Whether the process is as simple as imagined is another story. As misusage, in extremis, data professionals might start building everything as metrics, though the danger might come when the data is persisted unnecessarily. 

From a data mesh's perspective, a metric set is associated with a domain, though there will be metrics and data common to multiple domains. Moreover, a mini-model has the potential of becoming a data product. Distributing the logic across multiple workspaces and domains can add further challenges, especially in what concerns the synchronization and implemented of requirements in a way that doesn't lead to bottlenecks. But this is a general challenge for the development team(s). 

The feature will probably suffer further changes until is released in public review (probably by September or the end of the year). I subscribe to other data professionals' opinion that the feature was for long needed and that can have an important impact on the solutions built. 

Previous Post <<||>> Next Post

Resources:
[1] Microsoft Fabric Blog (2024) Announcements from the Microsoft Fabric Community Conference (link)
[2] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[3] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)
[4] KratosBI (2024) Fabric Fridays: Metrics Layer Conspiracy Theories #40 (link)
[5] Chris Webb's BI Blog (2022) Is Power BI A Semantic Layer? (link)
[6] The Data Stack Show (2022) TDSS 95: How the Metrics Layer Bridges the Gap Between Data & Business with Nick Handel of Transform (link)
[7] Sundeep Teki (2022) The Metric Layer & how it fits into the Modern Data Stack (link)
[8] Nick Handel (2021) A brief history of the metrics store (link)
[9] Aurimas (2022) The Jungle of Metrics Layers and its Invisible Elephant (link)
[10] Benn Stancil (2021) The missing piece of the modern data stack (link)
[11] Microsoft Learn (2024) Sempy fabric Package (link)
[12] Michael Kovalsky (2019) Master Model: Creating Derivative Tabular Models (link)
[13] Christina Obry (2023) The Power of a Metrics Layer - and How Your Organization Can Benefit From It (link
[14] KratosBI (2024) Introducing the Metrics Layer in #MicrosoftFabric with Carly Newsome [link]

25 March 2024

📊R Language: Regression Analysis with Simulated & Real Data

Before doing regression on a real dataset, one can use as minimum a set of simulated data to test the steps (code adapted after [1]):

# define the model with simulated data
n <- 100
x <- c(1:n)
error <- rnorm(n,0,10)
y <- 1+2*x+error
fit <- lm(y~x)

# plotting the values
plot(x, y, ylab="1+2*x+error")
lines(x, fit$fitted.values)

#using anova (analysis of variance)
anova(fit)

In the first step is created the data model, while in the second the data are plotted, while in the third the analysis of variance is run. For the y variable, can be used any linear function that represents a line in the plane. 

rnorm() function generates multivariate normal random variates based on the parameters given, therefore the output will vary between the runs of the above code. The bigger the value of the third parameter, the more dispersed the data is.

To test the code on real data, one can use the Sleuth3 library with the data from [2] (see RPubs):

install.packages ("Sleuth3")
library("Sleuth3")

Let's look at the data from the first case, which represent an experiment concerning the effects of intrinsic and extrinsic motivation on creativity run by the psychologist Teresa Amabile (see [2]):

attach(case0101)
case0101
summary(case0101)  

The regression can be applied to all the data:

# case 0101 (all data)
x <- c(1:47)
y <- case0101$Score
fit <- lm(y~x)
plot(x, y, ylab="Score")
lines(x, fit$fitted.values)

Though, a more appropriate analysis should be based on each questionnaire:

# case 0101 (extrinsic vs intrinsic treatments)
extrinsic <- subset(case0101, Treatment %in% "Extrinsic")
intrinsic <- subset(case0101, Treatment %in% "Intrinsic")

par(mfrow = c(1,2)) #1x2 matrix display
x <- c(1:length(extrinsic$Score))
y <- extrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Extrinsic Score")
lines(x, fit$fitted.values)

x <- c(1:length(intrinsic$Score))
y <- intrinsic$Score
fit <- lm(y~x)
plot(x, y, ylab="Intrinsic Score")
lines(x, fit$fitted.values)

title("Extrinsic vs. Intrinsic Motivation on Creativity", line = -2, outer = TRUE)

And, here's the output:

Case 0101 Extrinsic vs. Intrinsic Motivation on Creativity

Happy coding!

Previous Post <<||>> Next Post

21 October 2023

🧊Data Warehousing: Architecture V (Dynamics 365, the Data Lakehouse and the Medallion Architecture)

Data Warehousing
Data Warehousing Series

An IT architecture is built and functions under a set of constraints that derive from architecture’s components. Usually, if we want flexibility or to change something in one area, this might have an impact in another area. This rule applies to the usage of the medallion architecture as well! 

In Data Warehousing the medallion architecture considers a multilayered approach in building a single source of truth, each layer denoting the quality of data stored in the lakehouse [1]. For the moment are defined 3 layers - bronze for raw data, silver for validated data, and gold for enriched data. The concept seems sound considering that a Data Lake contains all types of raw data of different quality that needs to be validated and prepared for reporting or other purposes.

On the other side there are systems like Dynamics 365 that synchronize the data in near-real-time to the Data Lake through various mechanisms at table and/or data entity level (think of data entities as views on top of other tables or views). The databases behind are relational and in theory the data should be of proper quality as needed by business.

The greatest benefit of serverless SQL pool is that it can be used to build near-real-time data analytics solutions on top of the files existing in the Data Lake and the mechanism is quite simple. On top of such files are built external tables in serverless SQL pool, tables that reflect the data model from the source systems. The external tables can be called as any other tables from the various database objects (views, stored procedures and table-valued functions). Thus, can be built an enterprise data model with dimensions, fact-like and mart-like entities on top of the synchronized filed from the Data Lake. The Data Lakehouse (= Data Warehouse + Data Lake) thus created can be used for (enterprise) reporting and other purposes.

As long as there are no special requirements for data processing (e.g. flattening hierarchies, complex data processing, high-performance, data cleaning) this approach allows to report the data from the data sources in near-real time (10-30 minutes), which can prove to be useful for operational and tactical reporting. Tapping into this model via standard Power BI and paginated reports is quite easy. 

Now, if it's to use the data medallion approach and rely on pipelines to process the data, unless one is able to process the data in near-real-time or something compared with it, a considerable delay will be introduced, delay that can span from a couple of hours to one day. It's also true that having the data prepared as needed by the reports can increase the performance considerably as compared to processing the logic at runtime. There are advantages and disadvantages to both approaches. 

Probably, the most important scenario that needs to be handled is that of integrating the data from different sources. If unique mappings between values exist, unique references are available in one system to the records from the other system, respectively when a unique logic can be identified, the data integration can be handled in serverless SQL pool.

Unfortunately, when compared to on-premise or Azure SQL functionality, the serverless SQL pool has important constraints - it's not possible to use scalar UDFs, tables, recursive CTEs, etc. So, one needs to work around these limitations and in some cases use the Spark pool or pipelines. So, at least for exceptions and maybe for strategic reporting a medallion architecture can make sense and be used in parallel. However, imposing it on all the data can reduce flexibility!

Bottom line: consider the architecture against your requirements!

Previous Post <<||>>> Next Post

[1] What is the medallion lakehouse architecture?
https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

20 March 2021

🧭Business Intelligence: New Technologies, Old Challenges (Part II - ETL vs. ELT)

 

Business Intelligence

Data lakes and similar cloud-based repositories drove the requirement of loading the raw data before performing any transformations on the data. At least that’s the approach the new wave of ELT (Extract, Load, Transform) technologies use to handle analytical and data integration workloads, which is probably recommendable for the mentioned cloud-based contexts. However, ELT technologies are especially relevant when is needed to handle data with high velocity, variance, validity or different value of truth (aka big data). This because they allow processing the workloads over architectures that can be scaled with workloads’ demands.

This is probably the most important aspect, even if there can be further advantages, like using built-in connectors to a wide range of sources or implementing complex data flow controls. The ETL (Extract, Transform, Load) tools have the same capabilities, maybe reduced to certain data sources, though their newer versions seem to bridge the gap.

One of the most stressed advantages of ELT is the possibility of having all the (business) data in the repository, though these are not technological advantages. The same can be obtained via ETL tools, even if this might involve upon case a bigger effort, effort depending on the functionality existing in each tool. It’s true that ETL solutions have a narrower scope by loading a subset of the available data, or that transformations are made before loading the data, though this depends on the scope considered while building the data warehouse or data mart, respectively the design of ETL packages, and both are a matter of choice, choices that can be traced back to business requirements or technical best practices.

Some of the advantages seen are context-dependent – the context in which the technologies are put, respectively the problems are solved. It is often imputed to ETL solutions that the available data are already prepared (aggregated, converted) and new requirements will drive additional effort. On the other side, in ELT-based solutions all the data are made available and eventually further transformed, but also here the level of transformations made depends on specific requirements. Independently of the approach used, the data are still available if needed, respectively involve certain effort for further processing.

Building usable and reliable data models is dependent on good design, and in the design process reside the most important challenges. In theory, some think that in ETL scenarios the design is done beforehand though that’s not necessarily true. One can pull the raw data from the source and build the data models in the target repositories.

Data conversion and cleaning is needed under both approaches. In some scenarios is ideal to do this upfront, minimizing the effect these processes have on data’s usage, while in other scenarios it’s helpful to address them later in the process, with the risk that each project will address them differently. This can become an issue and should be ideally addressed by design (e.g. by building an intermediate layer) or at least organizationally (e.g. enforcing best practices).

Advancing that ELT is better just because the data are true (being in raw form) can be taken only as a marketing slogan. The degree of truth data has depends on the way data reflects business’ processes and the way data are maintained, while their quality is judged entirely on their intended use. Even if raw data allow more flexibility in handling the various requests, the challenges involved in processing can be neglected only under the consequences that follow from this.

Looking at the analytics and data integration cloud-based technologies, they seem to allow both approaches, thus building optimal solutions relying on professionals’ wisdom of making appropriate choices.

Previous Post <<||>>Next Post

21 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part II: The Import Layer)

Data Migration
Data Migrations Series

A data migration involves the mapping between two data models at (data) entity level, where the entity is a data abstraction modelling a business entity (e.g. Products, Vendors, Customers, Sales Orders, Purchase Orders, etc.). Thus, the Products business entity from the source will be migrated to a similar entity into the target. Ideally, the work would be simplified if the two models would provide direct access to the data through entities. Unfortunately, this is seldom the case, the entities being normalized and thus broken into several tables, with important structural differences. 

Therefore, the first step within a DM is identifying the business entities that make its scope from source and target, and providing a mapping between their attributes which will define how the data will flow between source and target. 

In theory, the source entity could be defined directly into the source with the help of views, if they are not already available. The problem with this approach is that the base data change, fact that can easily lead to inconsistencies between the various steps of the migration. For example, records are added, deleted, inactivated, or certain attributes are changed, fact that can easily make troubleshooting and validation a nightmare. 

The easiest way to address this is by assuring that the data will change only when actually needed. Is needed thus to create a snapshot of the data and work with it. Snapshots can become however costly for the performance of the source database, as they involve an additional maintenance overhead. Another solution is to make the snapshot via a backup or by copying the data via ETL functionality into another database (aka migration database). Considering that the data in scope make a small subset, a backup is usually costly as storage space and time, and is not always possible to take a backup when needed. 

An ETL-based solution for this provides an acceptable performance and is flexible enough to address all important types of requirements. The data can be accessed directly from the source (pull mechanism) or, when the direct access is an issue, they could be pushed to the migration database (push mechanism) or made available for load to a given location, then imported it into the migration database (hybrid mechanism). There’s also the possibility to integrate the migration database when a publisher/subscriber mechanism is in place, however such solutions raise other types of issues. 

One can import the tables 1:1 from the source for the entities in scope, attempt directly to model the entity within the ETL jobs or find a solution in between (e.g. import the base tables while considering joining also the dropdown tables). The latter seems to provide the best approach because it minimizes the numbers of tables to be imported while still reflecting the data structures from the source. An entity-based import addresses the first but not the second aspect, though depending on the requests it can work as well. 

In Data Warehousing (DW) there’s the practice to load the data into staging tables with no constraints on them, and only when the load is complete to move the data into the base tables which will be used as source for the further processing. This approach assures that the data are loaded completely and that the unavailability of the base tables is limited. In contrast to DW solutions is ideally not to perform any transformations on the data, as they should reflect the quality characteristics from the source. It's ideal to keep the data extraction, respectively the ETL jobs as simple as possible and resist building the migration logic already into this layer. 

19 May 2020

📦Data Migrations (DM): In-house Built Solutions (Part I - An Introduction)

Data Migration
Data Migrations Series

A Data Migration (DM) is the move of all or a subset of the data available from one or more system(s) into other system(s). For ERP systems a DM this can be achieved by having a database in between which allows the import and combination of data from various sources, the further processing of data, respectively the preparation of the data as per target system(s)’ needs. 

At a deeper lever is needed to model the entities from source and the target systems at level that allows easier processing, which resumes to removing and adding data, making mapping and transformations between the input and output models. In other words, a set of entities is transformed in another set of entities, while at a much lower level attributes from the source system must be mapped to the target system. 

SQL scripting provides enough flexibility in modelling the source or target system(s), in changing the data and logic on the fly, in building additional functionality, in reusing logic or cleaning the data. It works with big amounts of structured data, while for unstructured data one can still combine the power of SQL with ad-hoc or even NoSQL processing. It all sounds easy, isn’t it? Then from where comes the complexity people talk about?

Knowing SQL and the features of a (relational) database (the how) is only a small part from all is needed. Building a DM solution implies the knowledge of the source data models and of the data behind (the what), knowing the target (where), how and when to import the data, why the data must be made available, in what way the data has impact on the target system(s), respectively who needs to be involved. The more of these aspects are known, the higher the chances for the DM to succeed. 

When there’s a gap in the needed knowledge, then the knowledge must be acquired from the business and/or consultants, by investing time in data discovery or similar activities, in experimenting and testing. The more uncertainty, the more iterations are needed to achieve the degree of certainty (aka quality) needed. At minimum one needs at least 3-4 iterations to migrate the data – to build the concept, to test fully the migration, respectively to get the sign-off in a test environment and doing the migration into the production. 

Another type of complexity is derived from the interdependencies existing between the DM and vital activities like system parameterization and data cleaning. System’s parametrization defines how a system behaves, an important number of the parametrization values being reflected also in the prepared data for the DM. Wrong parametrization values can be trapped in the process during the various iterations, however poor data quality will reflect though the system after Go-Live and will haunt the business on the long term, unless addressed correctly into the project. 

All these aspects are reflected to some degree also in DM’s architecture (e.g. building restart points, including data validation and automatic cleaning) and process (e.g. sequencing of steps). From all the effort involved in DM, maybe 20-30% needs to be invested in building the solution. These 20-30% can be reduced in theory by using specialized DM tools, though the advantage provided by such tools can be illusory, especially when further limitations are involved. On the other side such tools may provide functionality that address the other 70-80%. 

As usual in IT, there’s a trade-off between advantages and disadvantages. An in-house build DM solution may provide the needed flexibility in the detriment of a small time investment. Whether it pays-off is something one has to prove by walking this path. 

03 January 2020

🗄️Data Management: Data Literacy (Part I: A Second Language)

Data Management

At the Gartner Data & Analytics Summit that took place in 2018 in Grapevine, Texas, it was reiterated the importance of data literacy for taking advantage of the emergence of data analytics, artificial intelligence (AI) and machine learning (ML) technologies. Gartner expected then that by 2020, 80% of organizations will initiate deliberate competency development in the field of data literacy [1] – or how they put it – learning to ‘speak data’ as a ‘second language’.

Data literacy is typically defined as the ability to read, work with, analyze, and argue with data. Sure, these form the blocks of data literacy, though what I’m missing from this definition is the ability to understand the data, even if understanding should be the outcome of reading, and the ability to put data into the context of business problems, even if the analyzes of data could involve this later aspect too.

Understanding has several aspects: understanding the data structures available within an organization, understanding the problems with data (including quality, governance, privacy and security), respectively understanding how the data are linked to the business processes. These aspects go beyond the simple ability included in the above definition, which from my perspective doesn’t include the particularities of an organization (data structure, data quality and processes) – the business component. This is reflected in one of the problems often met in the BI/data analytics industry – the solutions developed by the various service providers don’t reflect organizations’ needs, one of the causes being the inability to understand the business on segments or holistically.  

Putting data into context means being able to use the respective data in answering stringent business problems. A business problem needs to be first correctly defined and this requires a deep understanding of the business. Then one needs to identify the data that could help finding the answers to the problem, respectively of building one or more models that would allow elaborating further theories and performing further simulations. This is an ongoing process in which the models built are further enhanced, when possible, or replaced by better ones.

Probably the comparison with a second language is only partially true. One can learn a second language and argue in the respective language, though it doesn’t mean that the argumentations will be correct or constructive as long the person can’t do the same in the native language. Moreover, one can have such abilities in the native or a secondary language, but not be able do the same in what concerns the data, as different skillsets are involved. This aspect can make quite a difference in a business scenario. One must be able also to philosophize, think critically, as well to understand the forms of communication and their rules in respect to data.

To philosophize means being able to understand the causality and further relations existing within the business and think critically about them. Being able to communicate means more than being able to argue – it means being able to use effectively the communication tools – communication channels, as well the methods of representing data, information and knowledge. In extremis one might even go beyond the basic statistical tools, stepping thus in what statistical literacy is about. In fact, the difference between the two types of literacy became thinner, the difference residing in the accent put on their specific aspects.

These are the areas which probably many professionals lack. Data literacy should be the aim, however this takes time and is a continuous iterative process that can take years to reach maturity. It’s important for organizations to start addressing these aspects, progress in small increments and learn from the experience accumulated.

Previous Post <<||>> Next Post

References:
[1] Gartner (2018) How data and analytics leaders learn to master information as a second language, by Christy Pettey (link

29 November 2019

🧭Business Intelligence: Perspectives (Part V: Data Soup - From BI to Analytics)

Business Intelligence Series
Business Intelligence Series

The days when everything was reduced to simple terminology like reports or queries are gone. One can see it in the market trends related to reporting or data, as well in the jargon soup the IT people use on the daily basis – Business Intelligence (BI), Data Mining (DM), Analytics, Data Science, Data Warehousing (DW), Machine Learning (ML), Artificial Intelligence (AI) and so on. What’s more confusing for the users and other spectators is the easiness with which all these concepts are used, sometimes interchangeably, and often it feels like nothing makes sense.

BI is used nowadays to refer to the technologies, architectures, methodologies, processes and practices used to transform data into what is desired as meaningful and useful information.  From its early beginnings in the 60s, the intelligence from Business Intelligence (BI) refers to the ability to apprehend the interrelationships of the facts to be processed (aka data) in such a way as to guide action towards a desired goal.

The main purpose of BI was and is to guide actions and provide a solid basis for decision making, aspect not necessarily reflected in the way organizations use their BI infrastructure. Except basic operational/tactical/strategic reports and metrics that reflect to a higher or lower degree organizations’ goals, BI often fails to provide the expected value. The causes are multiple ranging from an organizations maturity in devising a strategy and dividing it into SMART goals and objectives, to the misuse of technologies for the wrong purposes.

Despite the basic data analysis techniques, the rich visualizations and navigation functionality, BI fails often to deliver by itself more than ordinary and already known information. Information becomes valuable when it brings novelty, when it can be easily transformed into knowledge, or even better, when knowledge is extracted directly. To address the limitations of the BI a series of techniques appeared in parallel and coined in the 90s as Data Mining.

Mining is the process of obtaining something valuable from a resource. What DM tries to achieve as process is the extraction of knowledge in form patterns from the data by categorizing, clustering, identifying dependencies or anomalies. When compared with data analysis, the main characteristics of DM is the fact that is used to test models and hypotheses, and that it uses a set of semiautomatic and automatic out-of-the-box statistics packages, AI or predictive algorithms with applicability in different areas – Web,  text, speech, business processes, etc.

DM proved to be useful by allowing to build models rooted in historical data, models which allowed predicting outcome or behavior, however the models are pretty basic and there’s always a threshold beyond which they can’t go. Furthermore, the costs of preparing the data and of the needed infrastructure seem to be high compared with the benefits data mining provides. There are scenarios in which DM proves to bring benefit, while in others it raises more challenges than can solve. Privacy, security, misuse of information and the blind use of techniques without understanding the data or the models behind, are just some of such challenges.  

Information seems too common, while knowledge can become expensive to obtain. The middle way between the two found its future into another buzzword – analyticsthe systematic analysis of data or statistics using specific mathematical methods. Analytics combine the agility of data analysis techniques with the power of predictive and prescriptive techniques used in DM in discovering patterns into the data. Analytics attempts to identify why it happens by using a chain of inferences resulted from data’s analyzing and understanding. From another perspective analytics seems to be a rebranded and slightly enhanced version of BI.

01 July 2019

🧱IT: Data Models (Definitions)

"A system data model is a collection of the information being addressed by a specific system or function such as a billing system, data warehouse, or data mart. The system model is an electronic representation of the information needed by that system. It is independent of any specific technology or DBMS environment." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"The business data model, sometimes known as the logical data model, describes the major things ('entities') of interest to the company and the relationships between pairs of these entities. It is an abstraction or representation of the data in a given business environment, and it provides the benefits cited for any model. It helps people envision how the information in the business relates to other information in the business ('how the parts fit together')." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

"The technology data model is a collection of the specific information being addressed by a particular system and implemented on a specific platform." (Claudia Imhoff et al, "Mastering Data Warehouse Design", 2003)

[enterprise data model:] "A high-level, enterprise-wide framework that describes the subject areas, sources, business dimensions, business rules, and semantics of an enterprise." (Sharon Allen & Evan Terry, "Beginning Relational Data Modeling 2nd Ed.", 2005)

[canonical data model:] "The definition of a standard organization view of a particular information subject. To be practical, canonical data models include a mapping back to each application view of the same subject." (David Lyle & John G Schmidt, "Lean Integration", 2010)

[hierarchical data model:] "A data model that represents data in a tree-like structure of only one-to-many relationships, where each entity may have a ‘many’ side when related to a parent, and a ‘one’ side when related to a child." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[Enterprise Data Model (EDM):] "A conceptual data model or logical data model providing a common consistent view of shared data across the enterprise, however that is defined, at a point in time. It is common to use the term to mean a high-level, simplified data model, but that is a question of abstraction for presentation." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[network data model:] "A representation of objects and their participation in one or more owner-member sets. In such a model, a both owners and members may participate in multiple sets, affecting a network of objects and relationships." (DAMA International, "The DAMA Dictionary of Data Management", 2011)

[enterprise data model:] "A single data model that comprehensively describes the data needs of the entire organization." (Craig S Mullins, "Database Administration", 2012)

[generic data model:] "A data model of an industry, rather than of a specific company; a generic data model can be used as a template that can be customized for a given company within the industry that has been modeled." (Daniel Linstedt & W H Inmon, "Data Architecture: A Primer for the Data Scientist", 2014)

[corporate data model:] "A data model at the corporate level for the core business objects and their relationship with each other, which is based on the core business object model." (Boris Otto & Hubert Österle, "Corporate Data Quality", 2015)

[Enterprise data model:] "The enterprise data model in many literatures and viewpoints is considered to be a single, standalone unified artifact that describes all data entities and data attributes and their relationships across the enterprise. In most cases this model is combined with the ambition to consolidate all data in an enterprise data warehouse (EDW)." (Piethein Strengholt, "Data Management at Scale", 2020)

14 January 2019

🔬Data Science: Evolutionary Algorithm (Definitions)

"An Evolutionary Algorithm (EA) is a general class of fitting or maximization techniques. They all maintain a pool of structures or models that can be mutated and evolve. At every stage in the algorithm, each model is graded and the better models are allowed to reproduce or mutate for the next round. Some techniques allow the successful models to crossbreed. They are all motivated by the biologic process of evolution. Some techniques are asexual (so, there is no crossbreeding between techniques) while others are bisexual, allowing successful models to swap ''genetic' information. The asexual models allow a wide variety of different models to compete, while sexual methods require that the models share a common 'genetic' code." (William J Raynor Jr., "The International Dictionary of Artificial Intelligence", 1999)

"Meta-heuristic optimization approach inspired by natural evolution, which begins with potential solution models, then iteratively applies algorithms to find the fittest models from the set to serve as inputs to the next iteration, ultimately leading to a sub-optimal solution which is close to the optimal one." (Gilles Lebrun et al, "EA Multi-Model Selection for SVM", 2009)

"Evolutionary algorithms are search methods that can be used for solving optimization problems. They mimic working principles from natural evolution by employing a population–based approach, labeling each individual of the population with a fitness and including elements of random, albeit the random is directed through a selection process." (Ivan Zelinka & Hendrik Richter, "Evolutionary Algorithms for Chaos Researchers", Studies in Computational Intelligence Vol. 267, 2010)

"Population-based optimization algorithms in which each member of the population represents a candidate solution. In an iterative process the population members evolve and are then evaluated by a fitness function. Genetic Algorithms and Particle Swarm Optimization are examples of evolutionary algorithms." (Efstathios Kirkos, "Composite Classifiers for Bankruptcy Prediction", 2014)

"A collective term for all variants of (probabilistic) optimization and approximation algorithms that are inspired by Darwinian evolution. Optimal states are approximated by successive improvements based on the variation-selection paradigm. Thereby, the variation operators produce genetic diversity and the selection directs the evolutionary search." (Harish Garg, "A Hybrid GA-GSA Algorithm for Optimizing the Performance of an Industrial System by Utilizing Uncertain Data", 2015)

24 December 2018

🔭Data Science: Data Mining (Just the Quotes)

"Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data. […] Data mining centers on the automated discovery of new facts and relationships in data. The idea is that the raw material is the business data, and the data mining algorithm is the excavator, sifting through the vast quantities of raw data looking for the valuable nuggets of business information." (Joseph P Bigus,"Data Mining with Neural Networks: Solving business problems from application development to decision support", 1996)

"Data mining is more of an art than a science. No one can tell you exactly how to choose columns to include in your data mining models. There are no hard and fast rules you can follow in deciding which columns either help or hinder the final model. For this reason, it is important that you understand how the data behaves before beginning to mine it. The best way to achieve this level of understanding is to see how the data is distributed across columns and how the different columns relate to one another. This is the process of exploring the data." (Seth Paul et al. "Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis", 2002)

"Things are changing. Statisticians now recognize that computer scientists are making novel contributions while computer scientists now recognize the generality of statistical theory and methodology. Clever data mining algorithms are more scalable than statisticians ever thought possible. Formal statistical theory is more pervasive than computer scientists had realized." (Larry A Wasserman, "All of Statistics: A concise course in statistical inference", 2004)

"Most mainstream data-mining techniques ignore the fact that real-world datasets are combinations of underlying data, and build single models from them. If such datasets can first be separated into the components that underlie them, we might expect that the quality of the models will improve significantly. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"The name ‘data mining’ derives from the metaphor of data as something that is large, contains far too much detail to be used as it is, but contains nuggets of useful information that can have value. So data mining can be defined as the extraction of the valuable information and actionable knowledge that is implicit in large amounts of data. (David Skillicorn, "Understanding Complex Datasets: Data Mining with Matrix Decompositions", 2007)

"Compared to traditional statistical studies, which are often hindsight, the field of data mining finds patterns and classifications that look toward and even predict the future. In summary, data mining can (1) provide a more complete understanding of data by finding patterns previously not seen and (2) make models that predict, thus enabling people to make better decisions, take action, and therefore mold future events." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"Traditional statistical studies use past information to determine a future state of a system (often called prediction), whereas data mining studies use past information to construct patterns based not solely on the input data, but also on the logical consequences of those data. This process is also called prediction, but it contains a vital element missing in statistical analysis: the ability to provide an orderly expression of what might be in the future, compared to what was in the past (based on the assumptions of the statistical method)." (Robert Nisbet et al, "Handbook of statistical analysis and data mining applications", 2009)

"The difference between human dynamics and data mining boils down to this: Data mining predicts our behaviors based on records of our patterns of activity; we don't even have to understand the origins of the patterns exploited by the algorithm. Students of human dynamics, on the other hand, seek to develop models and theories to explain why, when, and where we do the things we do with some regularity." (Albert-László Barabási, "Bursts: The Hidden Pattern Behind Everything We Do", 2010)

"Data mining is a craft. As with many crafts, there is a well-defined process that can help to increase the likelihood of a successful result. This process is a crucial conceptual tool for thinking about data science projects. [...] data mining is an exploratory undertaking closer to research and development than it is to engineering." (Foster Provost, "Data Science for Business", 2013)

"There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining. Students often confuse these two processes when studying data science, and managers sometimes confuse them when discussing business analytics. The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct." (Foster Provost & Tom Fawcett, "Data Science for Business", 2013)

"Unfortunately, creating an objective function that matches the true goal of the data mining is usually impossible, so data scientists often choose based on faith and experience." (Foster Provost, "Data Science for Business", 2013)

"Data Mining is the art and science of discovering useful innovative patterns from data. (Anil K. Maheshwari, "Business Intelligence and Data Mining", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"Today we routinely learn models with millions of parameters, enough to give each elephant in the world his own distinctive wiggle. It’s even been said that data mining means 'torturing the data until it confesses'." (Pedro Domingos, "The Master Algorithm", 2015)

"Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. 'Structure' has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants, which pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analysing data." (Fionn Murtagh, "Data Science Foundations: Geometry and Topology of Complex Hierarchic Systems and Big Data Analytics", 2018)

"The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

23 December 2018

🔭Data Science: Machine Learning (Just the Quotes)

"[…] an obvious difference between our best classifiers and human learning is the number of examples required in tasks such as object detection. […] the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"[…] learning techniques are similar to fitting a multivariate function to a certain number of measurement data. The key point, as we just mentioned, is that the fitting should be predictive in the same way that fitting experimental data from an experiment in physics can in principle uncover the underlying physical law, which is then used in a predictive way. In this sense, learning is also a principled method for distilling predictive and therefore scientific 'theories' from the data." (Tomaso Poggio & Steve Smale, "The Mathematics of Learning: Dealing with Data", Notices of the AMS, 2003)

"Much of machine learning is concerned with devising different models, and different algorithms to fit them. We can use methods such as cross validation to empirically choose the best method for our particular problem. However, there is no universally best model - this is sometimes called the no free lunch theorem. The reason for this is that a set of assumptions that works well in one domain may work poorly in another." (Kevin P Murphy, "Machine Learning: A Probabilistic Perspective", 2012)

"We have let ourselves become enchanted by big data only because we exoticize technology. We’re impressed with small feats accomplished by computers alone, but we ignore big achievements from complementarity because the human contribution makes them less uncanny. Watson, Deep Blue, and ever-better machine learning algorithms are cool. But the most valuable companies in the future won’t ask what problems can be solved with computers alone. Instead, they’ll ask: how can computers help humans solve hard problems?" (Peter Thiel & Blake Masters, "Zero to One: Notes on Startups, or How to Build the Future", 2014)

"A good proxy for complexity in a machine learning model is how fast it takes to train it." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"In machine learning, knowledge is often in the form of statistical models, because most knowledge is statistical [...] Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump." (Pedro Domingos, "The Master Algorithm", 2015)

"It is important to remember that predictive data analytics models built using machine learning techniques are tools that we can use to help make better decisions within an organization and are not an end in themselves. It is paramount that, when tasked with creating a predictive model, we fully understand the business problem that this model is being constructed to address and ensure that it does address it." (John D Kelleher et al, "Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, worked examples, and case studies", 2015)

"Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. Inductive reasoning, or inferring general rules from a limited set of examples, is not logically valid. To logically infer a rule describing every member of a set, one must have information about every member of that set." (Ian Goodfellow et al, "Deep Learning", 2015)

"Machine learning is a science and requires an objective approach to problems. Just like the scientific method, test-driven development can aid in solving a problem. The reason that TDD and the scientific method are so similar is because of these three shared characteristics: Both propose that the solution is logical and valid. Both share results through documentation and work over time. Both work in feedback loops." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Machine learning is the intersection between theoretically sound computer science and practically noisy data. Essentially, it’s about machines making sense out of data in much the same way that humans do." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Machine learning is well suited for the unpredictable future, because most algorithms learn from new information. But as new information is found, it can also come in unstable forms, and new issues can arise that weren’t thought of before. We don’t know what we don’t know. When processing new information, it’s sometimes hard to tell whether our model is working." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Machine learning takes many different forms and goes by many different names: pattern recognition, statistical modeling, data mining, knowledge discovery, predictive analytics, data science, adaptive systems, self-organizing systems, and more. Each of these is used by different communities and has different associations. Some have a long half-life, some less so." (Pedro Domingos, "The Master Algorithm", 2015)

"Precision and recall are ways of monitoring the power of the machine learning implementation. Precision is a metric that monitors the percentage of true positives. […] Recall is the ratio of true positives to true positive plus false negatives." (Matthew Kirk, "Thoughtful Machine Learning", 2015)

"Science’s predictions are more trustworthy, but they are limited to what we can systematically observe and tractably model. Big data and machine learning greatly expand that scope. Some everyday things can be predicted by the unaided mind, from catching a ball to carrying on a conversation. Some things, try as we might, are just unpredictable. For the vast middle ground between the two, there’s machine learning." (Pedro Domingos, "The Master Algorithm", 2015)

"The no free lunch theorem for machine learning states that, averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class. [...] the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the 'real world' that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about." (Ian Goodfellow et al, "Deep Learning", 2015)

"The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. We do so by building a set of preferences into the learning algorithm. When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better." (Ian Goodfellow et al, "Deep Learning", 2015)

"To make progress, every field of science needs to have data commensurate with the complexity of the phenomena it studies. [...] With big data and machine learning, you can understand much more complex phenomena than before. In most fields, scientists have traditionally used only very limited kinds of models, like linear regression, where the curve you fit to the data is always a straight line. Unfortunately, most phenomena in the world are nonlinear. [...] Machine learning opens up a vast new world of nonlinear models." (Pedro Domingos, "The Master Algorithm", 2015)

"Traditionally, the only way to get a computer to do something - from adding two numbers to flying an airplane - was to write down an algorithm explaining how, in painstaking detail. But machine-learning algorithms, also known as learners, are different: they figure it out on their own, by making inferences from data. And the more data they have, the better they get. Now we don’t have to program computers; they program themselves." (Pedro Domingos, "The Master Algorithm", 2015)

"In machine learning, a model is defined as a function, and we describe the learning function from the training data as inductive learning. Generalization refers to how well the concepts are learned by the model by applying them to data not seen before. The goal of a good machine-learning model is to reduce generalization errors and thus make good predictions on data that the model has never seen." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Machine learning is about making computers learn and perform tasks better based on past historical data. Learning is always based on observations from the data available. The emphasis is on making computers build mathematical models based on that learning and perform tasks automatically without the intervention of humans." (Umesh R Hodeghatta & Umesha Nayak, "Business Analytics Using R: A Practical Approach", 2017)

"Graphs can embed complex semantic representations in a compact form. As such, modeling data as networks of related entities is a powerful mechanism for analytics, both for visual analyses and machine learning. Part of this power comes from performance advantages of using a graph data structure, and the other part comes from an inherent human ability to intuitively interact with small networks." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"However, because ML algorithms are biased to look for different types of patterns, and because there is no one learning bias across all situations, there is no one best ML algorithm. In fact, a theorem known as the 'no free lunch theorem' states that there is no one best ML algorithm that on average outperforms all other algorithms across all possible data sets." (John D Kelleher & Brendan Tierney, "Data Science", 2018)

"Just as they did thirty years ago, machine learning programs (including those with deep neural networks) operate almost entirely in an associational mode. They are driven by a stream of observations to which they attempt to fit a function, in much the same way that a statistician tries to fit a line to a collection of points. Deep neural networks have added many more layers to the complexity of the fitted function, but raw data still drives the fitting process. They continue to improve in accuracy as more data are fitted, but they do not benefit from the 'super-evolutionary speedup'."  (Judea Pearl & Dana Mackenzie, "The Book of Why: The new science of cause and effect", 2018)

"Machine learning is often associated with the automation of decision making, but in practice, the process of constructing a predictive model generally requires a human in the loop. While computers are good at fast, accurate numerical computation, humans are instinctively and instantly able to identify patterns. The bridge between these two necessary skill sets lies in visualization - the precise and accurate rendering of data by a computer in visual terms and the immediate assignation of meaning to that data by humans." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"Quantum Machine Learning is defined as the branch of science and technology that is concerned with the application of quantum mechanical phenomena such as superposition, entanglement and tunneling for designing software and hardware to provide machines the ability to learn insights and patterns from data and the environment, and the ability to adapt automatically to changing situations with high precision, accuracy and speed." (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"Quantum machine learning promises to discover the optimal network topologies and hyperparameters automatically without human intervention." (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"The beauty of quantum machine learning is that we do not need to depend on an algorithm like gradient descent or convex objective function. The objective function can be nonconvex or something else." (Amit Ray, "Quantum Computing Algorithms for Artificial Intelligence", 2018)

"The premise of classification is simple: given a categorical target variable, learn patterns that exist between instances composed of independent variables and their relationship to the target. Because the target is given ahead of time, classification is said to be supervised machine learning because a model can be trained to minimize error between predicted and actual categories in the training data. Once a classification model is fit, it assigns categorical labels to new instances based on the patterns detected during training." (Benjamin Bengfort et al, "Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning", 2018)

"A recurring theme in machine learning is combining predictions across multiple models. There are techniques called bagging and boosting which seek to tweak the data and fit many estimates to it. Averaging across these can give a better prediction than any one model on its own. But here a serious problem arises: it is then very hard to explain what the model is (often referred to as a 'black box'). It is now a mixture of many, perhaps a thousand or more, models." (Robert Grant, "Data Visualization: Charts, Maps and Interactive Graphics", 2019)

"Machines are not good at asking questions or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound works with its trainer: the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail." (Brett Lantz, "Machine Learning with R", 2019)

"In an era of machine learning, where data is likely to be used to train AI, getting quality and governance under control is a business imperative. Failing to govern data surfaces problems late, often at the point closest to users (for example, by giving harmful guidance), and hinders explainability (garbage data in, machine-learned garbage out)." (Jesús Barrasa et al, "Knowledge Graphs: Data in Context for Responsive Businesses", 2021)

"Machine learning bias is typically understood as a source of learning error, a technical problem. […] Machine learning bias can introduce error simply because the system doesn’t 'look' for certain solutions in the first place. But bias is actually necessary in machine learning - it’s part of learning itself." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"People who assume that extensions of modern machine learning methods like deep learning will somehow 'train up', or learn to be intelligent like humans, do not understand the fundamental limitations that are already known. Admitting the necessity of supplying a bias to learning systems is tantamount to Turing’s observing that insights about mathematics must be supplied by human minds from outside formal methods, since machine learning bias is determined, prior to learning, by human designers." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"To accomplish their goals, what are now called machine learning systems must each learn something specific. Researchers call this giving the machine a 'bias'. […] A bias in machine learning means that the system is designed and tuned to learn something. But this is, of course, just the problem of producing narrow problem-solving applications." (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

"[...] the focus on Big Data AI seems to be an excuse to put forth a number of vague and hand-waving theories, where the actual details and the ultimate success of neuroscience is handed over to quasi- mythological claims about the powers of large datasets and inductive computation. Where humans fail to illuminate a complicated domain with testable theory, machine learning and big data supposedly can step in and render traditional concerns about finding robust theories. This seems to be the logic of Data Brain efforts today. (Erik J Larson, "The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do", 2021)

18 December 2018

🔭Data Science: Context (Just the Quotes)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"When evaluating the reliability and generality of data, it is often important to know the aims of the experimenter. When evaluating the importance of experimental results, however, science has a trick of disregarding the experimenter's rationale and finding a more appropriate context for the data than the one he proposed." (Murray Sidman, "Tactics of Scientific Research", 1960)

"Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]" (George Greenstein, "Frozen Star" , 1983)

"Graphics must not quote data out of context." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The problem solver needs to stand back and examine problem contexts in the light of different 'Ws' (Weltanschauungen). Perhaps he can then decide which 'W' seems to capture the essence of the particular problem context he is faced with. This whole process needs formalizing if it is to be carried out successfully. The problem solver needs to be aware of different paradigms in the social sciences, and he must be prepared to view the problem context through each of these paradigms." (Michael C Jackson, "Towards a System of Systems Methodologies", 1984)

"It is commonly said that a pattern, however it is written, has four essential parts: a statement of the context where the pattern is useful, the problem that the pattern addresses, the forces that play in forming a solution, and the solution that resolves those forces. [...] it supports the definition of a pattern as 'a solution to a problem in a context', a definition that [unfortunately] fixes the bounds of the pattern to a single problem-solution pair." (Martin Fowler, "Analysis Patterns: Reusable Object Models", 1997)

"We do not learn much from looking at a model - we learn more from building the model and manipulating it. Just as one needs to use or observe the use of a hammer in order to really understand its function, similarly, models have to be used before they will give up their secrets. In this sense, they have the quality of a technology - the power of the model only becomes apparent in the context of its use." (Margaret Morrison & Mary S Morgan, "Models as mediating instruments", 1999)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise."  (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] you simply cannot make sense of any number without a contextual basis. Yet the traditional attempts to provide this contextual basis are often flawed in their execution. [...] Data have no meaning apart from their context. Data presented without a context are effectively rendered meaningless." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"All scientific theories, even those in the physical sciences, are developed in a particular cultural context. Although the context may help to explain the persistence of a theory in the face of apparently falsifying evidence, the fact that a theory arises from a particular context is not sufficient to condemn it. Theories and paradigms must be accepted, modified or rejected on the basis of evidence." (Richard P Bentall,  "Madness Explained: Psychosis and Human Nature", 2003)

"Mathematical modeling is as much ‘art’ as ‘science’: it requires the practitioner to (i) identify a so-called ‘real world’ problem (whatever the context may be); (ii) formulate it in mathematical terms (the ‘word problem’ so beloved of undergraduates); (iii) solve the problem thus formulated (if possible; perhaps approximate solutions will suffice, especially if the complete problem is intractable); and (iv) interpret the solution in the context of the original problem." (John A Adam, "Mathematics in Nature", 2003)

"Context is not as simple as being in a different space [...] context includes elements like our emotions, recent experiences, beliefs, and the surrounding environment - each element possesses attributes, that when considered in a certain light, informs what is possible in the discussion." (George Siemens, "Knowing Knowledge", 2006)

"Statistics can certainly pronounce a fact, but they cannot explain it without an underlying context, or theory. Numbers have an unfortunate tendency to supersede other types of knowing. […] Numbers give the illusion of presenting more truth and precision than they are capable of providing." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"A valid digit is not necessarily a significant digit. The significance of numbers is a result of its scientific context." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[… ] statistics is about understanding the role that variability plays in drawing conclusions based on data. […] Statistics is not about numbers; it is about data - numbers in context. It is the context that makes a problem meaningful and something worth considering." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Context (information that lends to better understanding the who, what, when, where, and why of your data) can make the data clearer for readers and point them in the right direction. At the least, it can remind you what a graph is about when you come back to it a few months later. […] Context helps readers relate to and understand the data in a visualization better. It provides a sense of scale and strengthens the connection between abstract geometry and colors to the real world." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"The data is a simplification - an abstraction - of the real world. So when you visualize data, you visualize an abstraction of the world, or at least some tiny facet of it. Visualization is an abstraction of data, so in the end, you end up with an abstraction of an abstraction, which creates an interesting challenge. […] Just like what it represents, data can be complex with variability and uncertainty, but consider it all in the right context, and it starts to make sense." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Without context, data is useless, and any visualization you create with it will also be useless. Using data without knowing anything about it, other than the values themselves, is like hearing an abridged quote secondhand and then citing it as a main discussion point in an essay. It might be okay, but you risk finding out later that the speaker meant the opposite of what you thought." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Statistics are meaningless unless they exist in some context. One reason why the indicators have become more central and potent over time is that the longer they have been kept, the easier it is to find useful patterns and points of reference." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context." (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Infographics combine art and science to produce something that is not unlike a dashboard. The main difference from a dashboard is the subjective data and the narrative or story, which enhances the data-driven visual and engages the audience quickly through highlighting the required context." (Travis Murphy, "Infographics Powered by SAS®: Data Visualization Techniques for Business Reporting", 2018)

"For numbers to be transparent, they must be placed in an appropriate context. Numbers must presented in a way that allows for fair comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Without knowing the source and context, a particular statistic is worth little. Yet numbers and statistics appear rigorous and reliable simply by virtue of being quantitative, and have a tendency to spread." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

More quotes on "Context" at the-web-of-knowledge.blogspot.com


Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.