Showing posts with label abstraction. Show all posts
Showing posts with label abstraction. Show all posts

23 February 2025

💎🏭SQL Reloaded: Microsoft Fabric's SQL Databases (Part IX: From OLTP to OLAP Data Models)

With SQL databases Microsoft brought OLTP to Microsoft Fabric which allows addressing a wider range of requirements, though this involves also some challenges that usually are addressed by the transition from the OLTP to OLAP architectures. Typically, there's an abstraction layer that is built on top of the OLTP data models that allows to address the various OLAP requirements. As soon as OLTP and OLAP models are mixed together, this opens the door to design and data quality issues that have impact on the adoption of solutions by users. Probably, those who worked with MS Access or even MS Excel directly or in combination with SQL Server can still remember the issues they run into.

Ideally, it should be a separation layer between OLTP and the OLAP data. This can be easily achieved in SQL databases by using two different schemas that mimic the interaction between the two types of architectures. So, supposing that the dbo schema from the SalesLT is the data as maintain by the OLTP layer, one can add an additional schema Test in which the OLAP logic is modelled. This scenario is not ideal, though it allows to model the two aspects of the topic considered. The following steps are to be performed in the environment in which the SalesLT database was created. 

Independently in which layer one works, it's ideal to create a set of views that abstracts the logic and ideally simplifies the processing of data. So, in a first step it's recommended to abstract the data from the source by creating a set of views like the one below:

-- drop view (cleaning)
-- DROP VIEW IF EXISTS SalesLT.vCustomerLocations 

-- create view
CREATE VIEW SalesLT.vCustomerLocations
-- Customers with main office
AS
SELECT CST.CustomerId 
, CSA.AddressID
, CST.Title
, CST.FirstName 
, IsNull(CST.MiddleName, '') MiddleName
, CST.LastName 
, CST.CompanyName 
, CST.SalesPerson 
, IsNull(CSA.AddressType, '') AddressType
, IsNull(ADR.City, '') City
, IsNull(ADR.StateProvince, '') StateProvince
, IsNull(ADR.CountryRegion, '') CountryRegion
, IsNull(ADR.PostalCode, '') PostalCode
FROM SalesLT.Customer CST
	 LEFT JOIN SalesLT.CustomerAddress CSA
	   ON CST.CustomerID = CSA.CustomerID
	  AND CSA.AddressType = 'Main Office'
	 	LEFT JOIN SalesLT.Address ADR
		  ON CSA.AddressID = ADR.AddressID

The view uses LEFT instead of FULL joins because this allows more flexibility, respectively identifying the gaps existing between entities (e.g. customers without addresses). In these abstractions, the number of transformations is kept to a minimum to reflect the data as reflected by the source. It may be chosen to minimize the occurrence of NULL values as this simplifies the logic for comparisons (see the use of IsNull).

Once the abstraction from the OLTP layer was built, one can make the data available in the OLAP layer:

-- create schema
CREATE SCHEMA Test

-- dropping the target table (for cleaning)
-- DROP TABLE IF EXISTS Test.CustomerLocations

-- Option 1
-- create the table on the fly
SELECT *
INTO Test.CustomerLocations
FROM SalesLT.vCustomerLocations

-- Option 2
-- create the table manually (alternative to precedent step
CREATE TABLE [Test].[CustomerLocations](
	[CustomerId] [int] NOT NULL,
	[AddressID] [int] NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [nvarchar](50) NULL,
	[MiddleName] [nvarchar](50) NULL,
	[LastName] [nvarchar](50) NULL,
	[CompanyName] [nvarchar](128) NULL,
	[SalesPerson] [nvarchar](256) NULL,
	[AddressType] [nvarchar](50) NULL,
	[City] [nvarchar](30) NULL,
	[StateProvince] [nvarchar](50) NULL,
	[CountryRegion] [nvarchar](50) NULL,
	[PostalCode] [nvarchar](15) NULL
) ON [PRIMARY]
GO

-- insert records
INSERT INTO Test.CustomerLocations
SELECT *
FROM SalesLT.vCustomerLocations


-- checking the output (both scenarios)
SELECT top 100 *
FROM Test.CustomerLocations


-- drop the view (for cleaning)
-- DROP VIEW IF EXISTS Test.vCustomerLocations

-- create view
CREATE VIEW Test.vCustomerLocations
-- Customer locations
AS
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, Concat(CSL.FirstName, ' ' + CSL.MiddleName, ' ', CSL.LastName) FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType
, CSL.City
, CSL.StateProvince
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.CustomerLocations CSL

-- test the view
SELECT top 100 *
FROM Test.vCustomerLocations

Further on, one can create additional objects as required. Usually, a set of well-designed views is enough, offering the needed flexibility with a minimum of code duplication. In addition, one can build stored procedures and table-valued functions as needed:

-- drop the function (for cleaning)
-- DROP FUNCTION IF EXISTS Test.tvfGetCustomerAddresses

-- generated template - function
CREATE FUNCTION Test.tvfGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
RETURNS TABLE
-- Customers by Country & State province
AS
RETURN (
SELECT CSL.CustomerId 
, CSL.AddressID
, CSL.Title
, CSL.FirstName 
, CSL.MiddleName 
, CSL.LastName 
, CSL.FullName
, CSL.CompanyName 
, CSL.SalesPerson 
, CSL.AddressType 
, CSL.City
, CSL.StateProvince 
, CSL.CountryRegion 
, CSL.PostalCode
FROM Test.vCustomerLocations CSL
WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
  AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
);

-- retrieving all records
SELECT *
FROM Test.tvfGetCustomerAddresses(NULL, NULL)

-- providing parameters
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')

-- filtering on non-parametrized volumns
SELECT *
FROM Test.tvfGetCustomerAddresses('United States', 'Utah')
WHERE City = 'Salt Lake City'



-- drop the procedure (for cleaning)
-- DROP PROCEDURE IF EXISTS Test.spGetCustomerAddresses 

-- generated template - stored procedure
CREATE PROCEDURE Test.spGetCustomerAddresses (
    @CountryRegion nvarchar(50) NULL,
    @StateProvince nvarchar(50) NULL
)
-- Customers by Country & State province
AS
BEGIN
	SELECT CSL.CustomerId 
	, CSL.AddressID
	, CSL.Title
	, CSL.FirstName 
	, CSL.MiddleName 
	, CSL.LastName 
	, CSL.FullName
	, CSL.CompanyName 
	, CSL.SalesPerson 
	, CSL.AddressType 
	, CSL.City
	, CSL.StateProvince 
	, CSL.CountryRegion 
	, CSL.PostalCode
	FROM Test.vCustomerLocations CSL
	WHERE CSL.CountryRegion = IsNull(@CountryRegion, CSL.CountryRegion)
	AND CSL.StateProvince = IsNull(@StateProvince, CSL.StateProvince)
END 

-- retrieving all records
EXEC Test.spGetCustomerAddresses NULL, NULL

-- providing parameters
 EXEC Test.spGetCustomerAddresses 'United States', 'Utah'

These steps can repeated for each entity in scope.

This separation between OLTP and OLAP is usually necessary given that business processes need a certain amount of time until they are correctly reflected as per reporting needs. Otherwise, the gaps can negatively impact the quality of data used for reporting. For some reports these deviation might be acceptable, though there will be probably also (many) exceptions. Independently of the solution used, it's still needed to make sure that the data are appropriate for the processes and reporting. 

If no physical separation is needed between the two types of layers, one can remove the persisted tables from the logic and keep the objects as they are.

Independently of which architecture is chosen, one shouldn't forget to validate one's presumptions in what concerns the data model (e.g. customers without addresses, address types, etc.).

Previous Post <<||>> Next Post

07 August 2024

🧭Business Intelligence: Perspectives (Part XII: From Data to Data Models)

Business Intelligence Series
Business Intelligence Series

A data model can be defined as an abstract, self-contained, logical definition of the data structures available in a database or similar repositories. It’s typically an abstraction of the data structures underpinning a set of processes, procedures and business logic used for a predefined purpose. A data model can be formed also of unrelated micromodels, depicting thus various aspects of a business. 

The association between data and data models is bidirectional. Given a set of data, a data model can be built to underpin the respective data. Conversely, one can create or generate data based on a data model. However, in business setups a bidirectional relationship between data and the data model(s) underpinning them is more realistic as the business evolves. In extremis, the data model can be used to reflect a business’ needs, at least when the respective needs are addressed accordingly by extending the data model(s).

Given a set of data (e.g. the data stored in one or more spreadsheets or other type of files) there can be defined in theory multiple data models to reflect the respective data. Within a data model, the fields (aka attributes) are partitioned into a set of data entities, where a data entity is thus a nonunique grouping of attributes that attempt to define together one unitary aspect of the world. Customers, Vendors, Products, Invoices or Sales Orders are examples of such data entities, though entities can have a broader granularity (e.g. Customers can be modeled over several tables like Entity, Addresses, Contact information, etc.). 

From an operational database’s perspective, a data entity is based on one or more tables, though several entities can share some of the tables. From a BI artifact’s perspective, an entity should be easy to create from the underlying tables, with a minimal set of transformations. Ideally, the BI data model should be as close as possible to the needed entity for reporting, however an optimal solution lies usually somewhere in between. In this resides the complexity of modeling BI solutions – providing an optimal data model which can be easily built on the source tables, and which allows addressing all or at least most of the BI requirements.

In other words, we deal with two optimization problems of two distinct data models. On one side the business data model must be flexible enough to provide fast read/write operations while keeping the referential data’s granularity efficient. Conversely, a BI data model needs to abstract these entities and provide a fast way of processing the data, while making data reads extremely efficient. These perspectives must apply when we move to Microsoft Fabric too. 

The operational data layer must provide this abstraction, and in this resides the complexity of building optimal BI solutions. This is the layer at which the modeling problems need to be tackled. The challenge of BI and Analytics resides in finding an optimal data model that allows us to address most or ideally all the BI requirements. Several overlapping layers of abstraction may be built in the process.

Looking at the data modeling techniques used in notebooks and other similar solutions, data modeling has the chance of becoming a redundant practice prone to errors. Moreover, data models have a tendency of being multilayered and of being based on certain perspectives into the processes they model. Providing reliable flexible models involves finding the right view into the data for modeling aspects of the business. Database views allow us to easily model such perspectives, often in a unique way. Moving away from them just shifts the burden on the multiple solutions built around the base data, which can create other important challenges. 

Previous Post <<||>>  Next Post

07 May 2024

🏭🗒️Microsoft Fabric: The Metrics Layer [Notes] [new feature]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 07-May-2024

The Metrics Layer in Microsoft Fabric (adapted diagram)
The Metrics Layer in Microsoft Fabric (adapted diagram)

[new feature] Metrics Layer (Metrics Store)

  • {definition}an abstraction layer available between the data store(s) and end users which allows organizations to create standardized business metrics, that are rooted in measures and are discoverable and intended for reuse
    • ⇐ {important} feature still in private preview 
  • {goal} extend existing infrastructure 
    • {benefit} leverages and extends existing features
  • {goal} provide consistent definitions and descriptions [1]
    • consistent definitions that include besides business logic additional dimensions and filters [1]
    • ⇒ {benefit} allows to standardize the metrics across the organization
    • ⇒ {benefit} enforce to enforce a SSoT
  • {goal} easy management 
    • via management views 
    • [feature] lineage 
    • [feature] source control
    • [feature] duplicate identification
    • [feature] push updates to downstream uses of the metrics 
  • {goal}searchable and discoverable metrics 
    • {feature} integration
      • based on Sempy fabric package
        • ⇐ a dataframe for storage and propagation of Power BI metadata which is part of the python-based semantic Link in Fabric
  • {goal}trust
    • [feature] trust indicators
    • {benefit} facilitates report's adoption
  • {feature} metric set 
    • {definition} a Fabric item that groups together a set of metrics into a mini-model
    • {benefit} allows to reduce the overall complexity of semantic models, while being easy to evolve and consume
    • associated with a single domain
      • ⇒ supports the data mesh architecture
    • shareable 
      • can be shared with other users
    • {action} create metric set
      • creates the actual artifact, to which metrics can be added 
  • {feature} metric
    •  {definition} a way to elevate the measures from the various semantic models existing in the organization
    • tied to the original semantic model
      • ⇒ {benefit} allows to see how a metric is used across the solutions 
    • reusable
      • can be reused in other fabric artifacts
        • new reports on the Power BI service
        • notebooks 
          • by copying the code
      • can be reused in Power BI
        • via OneLake data hub menu element
      • can be chained 
        • changes are propagated downstream 
    • materializable 
      • its output can be persisted to OneLake by saving it a delta table into a lakehouse
      • {misuse} data is persisted unnecessarily
    • {action} elevate metric
      • copies measure's definition and description
      • ⇒  implies restructuring, refactoring, moving, and testing a lot of code in the process
      • {misuse} data professionals build everything as metrics
    • {action} update metric
    • {action} add filters to metric 
    • {action} add dimensions to metric
    • {action} materialize metric 

Acronyms:
SSoT - single source of truth ()

References:
[1] Power BI Tips (2024) Explicit Measures Ep. 236: Metrics Hub, Hot New Feature with Carly Newsome (link)
[2] Power BI Tips (2024) Introducing Fabric Metrics Layer / Power Metrics Hub [with Carly Newsome] (link)

10 April 2024

🧭Business Intelligence: Perspectives (Part XI: Ways of Thinking about Data)

Business Intelligence Series

One can observe sometimes the tendency of data professionals to move from a business problem directly to data and data modeling without trying to understand the processes behind the data. One could say that the behavior is driven by the eagerness of exploring the data, though even later there are seldom questions considered about the processes themselves. One can argue that maybe the processes are self-explanatory, though that’s seldom the case. 

Conversely, looking at the datasets available on the web, usually there’s a fact table and the associated dimensions, the data describing only one process. It’s natural to presume that there are data professionals who don’t think much about, or better said in terms of processes. A similar big jump can be observed in blog posts on dashboards and/or reports, bloggers moving from the data directly to the data model. 

In the world of complex systems like Enterprise Resource Planning (ERP) systems thinking in terms of processes is mandatory because a fact table can hold the data for different processes, while processes can span over multiple fact-like tables, and have thus multiple levels of detail. Moreover, processes are broken down into sub-processes and procedures that have a counterpart in the data as well. 

Moreover, within a process there can be multiple perspectives that are usually module or role dependent. A perspective is a role’s orientation to the word for which the data belongs to, and it’s slightly different from what the data professional considers as view, the perspective being a projection over a set of processes within the data, while a view is a projection of the perspectives into the data structure. 

For example, considering the order-to-cash process there are several sub-processes like order fulfillment, invoicing, and payment collection, though there can be several other processes involved like credit management or production and manufacturing. Creating, respectively updating, or canceling an order can be examples of procedures. 

The sales representative, the shop worker and the accountant will have different perspectives projected into the data, focusing on the projection of the data on the modules they work with. Thinking in terms of modules is probably the easiest way to identify the boundaries of the perspectives, though the rules are occasionally more complex than this.

When defining and/or attempting to understand a problem it’s important to understand which perspective needs to be considered. For example, the sales volume can be projected based on Sales orders or on invoiced Sales orders, respectively on the General ledger postings, and the three views can result in different numbers. Moreover, there are partitions within these perspectives based on business rules that determine what to include or exclude from the logic. 

One can define a business rule as a set of conditional logic that constraints some part of the data in the data structures by specifying what is allowed or not, though usually we refer to a special type called selection business rule that determines what data are selected (e.g. open Purchase orders, Products with Inventory, etc.). However, when building the data model we need to consider business rules as well, though we might need to check whether they are enforced as well. 

Moreover, it’s useful to think also in terms of (data) entities and sub-entities, in which the data entity is an abstraction from the physical implementation of database tables. A data entity encapsulates (hides internal details) a business concept and/or perspective into an abstraction (simplified representation) that makes development, integration, and data processing easier. In certain systems like Dynamics 365 is important to think at this level because data entities can simplify data modelling considerably.

Previous Post <<||>>  Next Post

21 May 2020

💼Project Management: Project Planning (Part III: Planning Correctly Misunderstood III)

Mismanagement

One of the most misunderstood topics in Project Management seems to be the one of planning, and this probably because everyone has a good idea of what it means to plan an activity – we do it daily and most of the times (hopefully) we hit a bull’s-eye (or we have the impression we did that). You must do this and that, you have that dependency, you must coordinate with a few people, you must first reach that milestone before going further, you do one step at a time, and so on. It’s pretty easy, isn’t it?

From a bird’s eyes view project planning is like planning every other activity though there are several important differences. The most important one is of scale – the number of activities and resources involved, the level of coordination and communication, as well the quality with which occur, the level of uncertainty and control, respectively manageability. All these create a complexity that is hardly manageable by just one person. 

Another difference is the detail needed for the planning and targets’ reachability. Some believe that the plan needs to be done down to the lowest level of detail, which even if possible can prove to be an impediment to planning. Projects’ environment share some important characteristics with a battle field in terms of complexity of interactions, their dynamics and logistical requirements. Within an army’s structure there are levels of organization that require different mindsets and levels of planning. A general thinks primarily at strategic level in which troops and actions are seen as aggregations at the needed level of abstraction that makes their organization and planning manageable. The strategy is done however in collaboration with other generals and upper structures, while having defined the strategic goals the general must devise together with the immediate subalterns the tactics. In theory the project manager must regard the project from the same perspective. Results thus three levels of planning – strategic, done with the upper management, tactical done with the team members, respectively logistical, done within the team. That’s a way of breaking the complexity and dividing the responsibilities within the project. 

Projects’ final destination seem to have the character of a wish list more or less anchored in reality. From a technical point the target can be achievable though in big projects the most important challenges are of organizational nature – of being able to allocate and coordinate effectively the resources as needed by the project. The wish-like character is reflected also by the cost, scope, time triangle in respect to the expected quality – to some point in time one is forced to choose between two of them. On the other side, there’s the tendency to see the targets and milestones as fixed, with little room for deviation. One can easily forget that a strategic plan’s purpose is to set the objectives, identify the challenges and the possible lines of action, while a tactical plan’s objective is to devise the means to reach the objectives. Bringing everything together can easily obscure the view and, in extremis, the plan loses its actuality as soon was created (and approved). 

The most confusing aspect is probably the adherence of a plan to a given methodology, one dicing a project and thus a plan to fit a methodology by following blindly the rules and principles imposed by it instead of fitting the methodology to a project. Besides the fact that the methodologies are best practices but not necessarily good practices, what fits for an organization, they tend to be either too general, by specifying the what and not the how, or too restrictive (interpreted). 

12 May 2019

#️⃣Software Engineering: Programming (Part XII: Misconceptions about Programming - Part I)

Software Engineering
Software Engineering Series

Besides equating the programming process with a programmer’s capabilities, minimizing the importance of programming and programmers’ skills in the whole process (see previous post), there are several other misconceptions about programming that influence process' outcomes.


Having a deep knowledge of a programming language allows programmers to easily approach other programming languages, however each language has its own learning curve ranging from a few weeks to half of year or more. The learning curve is dependent on the complexity of the languages known and the language to be learned, same applying to frameworks and architectures, the scenarios in which the languages are used, etc. One unrealistic expectation is that the programmers are capablle of learning a new programming language or framework overnight, this expectation pushing more pressure on programmers’ shoulders as they need to compensate in a short time for the knowledge gap. No, the programming languages are not the same even if there’s high resemblance between them!

There’s lot of code available online, many of the programming tasks involve writing similar code. This makes people assume that programming can resume to copy-paste activities and, in extremis, that there’s no creativity into the act of programming. Beside the fact that using others’ code comes with certain copyright limitations, copy-pasting code is in general a way of introducing bugs in software. One can learn a lot from others’ code, though programmers' challenge resides in writing better code, in reusing code while finding the right the level of abstraction.  
 
There’s the tendency on the market to build whole applications using wizard-like functionality and of generating source-code based on data or ontological models. Such approaches work in a range of (limited) scenarios, and even if the trend is to automate as much in the process, is not what programming is about. Each such tool comes with its own limitations that sooner or later will push back. Changing the code in order to build new functionality or to optimize the code is often not a feasible solution as it imposes further limitations.

Programming is not only about writing code. It involves also problem-solving abilities, having a certain understanding about the business processes, in which the conceptual creativity and ingenuity of design can prove to be a good asset. Modelling and implementing processes help programmers gain a unique perspective within a business.

For a programmer the learning process never stops. The release cycle for the known tools becomes smaller, each release bringing a new set of functionalities. Moreover, there are always new frameworks, environments, architectures and methodologies to learn. There’s a considerable amount of effort in expanding one's (necessary) knowledge, effort usually not planned in projects or outside of them. Trainings help in the process, though they hardly scratch the surface. Often the programmer is forced to fill the knowledge gap in his free time. This adds up to the volume of overtime one must do on projects. On the long run it becomes challenging to find the needed time for learning.

In resource planning there’s the tendency to add or replace resources on projects, while neglecting the influence this might have on a project and its timeline. Each new resource needs some time to accommodate himself on the role, to understand project requirements, to take over the work of another. Moreover, resources are replaced on project with a minimal or even without the knowledge transfer necessary for the job ahead. Unfortunately, same behavior occurs in consultancy as well, consultants being moved from one known functional area into another unknown area, changing the resources like the engines of different types of car, expecting that everything will work as magic.



21 December 2018

🔭Data Science: Estimation (Just the Quotes)

"The scientific value of a theory of this kind, in which we make so many assumptions, and introduce so many adjustable constants, cannot be estimated merely by its numerical agreement with certain sets of experiments. If it has any value it is because it enables us to form a mental image of what takes place in a piece of iron during magnetization." (James C Maxwell, "Treatise on Electricity and Magnetism" Vol. II, 1873)

"It [probability] is the very guide of life, and hardly can we take a step or make a decision of any kind without correctly or incorrectly making an estimation of probabilities." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Arthur L Bowley, "Elements of Statistics", 1901)

"Great numbers are not counted correctly to a unit, they are estimated; and we might perhaps point to this as a division between arithmetic and statistics, that whereas arithmetic attains exactness, statistics deals with estimates, sometimes very accurate, and very often sufficiently so for their purpose, but never mathematically exact." (Arthur L Bowley, "Elements of Statistics", 1901)

“Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible.” (Sir Arthur L Bowley, “Elements of Statistics”, 1901)

"A good estimator will be unbiased and will converge more and more closely (in the long run) on the true value as the sample size increases. Such estimators are known as consistent. But consistency is not all we can ask of an estimator. In estimating the central tendency of a distribution, we are not confined to using the arithmetic mean; we might just as well use the median. Given a choice of possible estimators, all consistent in the sense just defined, we can see whether there is anything which recommends the choice of one rather than another. The thing which at once suggests itself is the sampling variance of the different estimators, since an estimator with a small sampling variance will be less likely to differ from the true value by a large amount than an estimator whose sampling variance is large." (Michael J Moroney, "Facts from Figures", 1951)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"We realize that if someone just 'grabs a handful', the individuals in the handful almost always resemble one another (on the average) more than do the members of a simple random sample. Even if the 'grabs' [sampling] are randomly spread around so that every individual has an equal chance of entering the sample, there are difficulties. Since the individuals of grab samples resemble one another more than do individuals of random samples, it follows (by a simple mathematical argument) that the means of grab samples resemble one another less than the means of random samples of the same size. From a grab sample, therefore, we tend to underestimate the variability in the population, although we should have to overestimate it in order to obtain valid estimates of variability of grab sample means by substituting such an estimate into the formula for the variability of means of simple random samples. Thus using simple random sample formulas for grab sample means introduces a double bias, both parts of which lead to an unwarranted appearance of higher stability." (Frederick Mosteller et al, "Principles of Sampling", Journal of the American Statistical Association Vol. 49 (265), 1954)

"The usefulness of the models in constructing a testable theory of the process is severely limited by the quickly increasing number of parameters which must be estimated in order to compare the predictions of the models with empirical results" (Anatol Rapoport, "Prisoner's Dilemma: A study in conflict and cooperation", 1965)

"Pencil and paper for construction of distributions, scatter diagrams, and run-charts to compare small groups and to detect trends are more efficient methods of estimation than statistical inference that depends on variances and standard errors, as the simple techniques preserve the information in the original data." (William E Deming, "On Probability as Basis for Action" American Statistician Vol. 29 (4), 1975)

"In physics it is usual to give alternative theoretical treatments of the same phenomenon. We construct different models for different purposes, with different equations to describe them. Which is the right model, which the 'true' set of equations? The question is a mistake. One model brings out some aspects of the phenomenon; a different model brings out others. Some equations give a rougher estimate for a quantity of interest, but are easier to solve. No single model serves all purposes best." (Nancy Cartwright, "How the Laws of Physics Lie", 1983)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"A mechanistic model has the following advantages: 1. It contributes to our scientific understanding of the phenomenon under study. 2. It usually provides a better basis for extrapolation (at least to conditions worthy of further experimental investigation if not through the entire range of all input variables). 3. It tends to be parsimonious (i. e, frugal) in the use of parameters and to provide better estimates of the response." (George E P Box, "Empirical Model-Building and Response Surfaces", 1987)

"A tendency to drastically underestimate the frequency of coincidences is a prime characteristic of innumerates, who generally accord great significance to correspondences of all sorts while attributing too little significance to quite conclusive but less flashy statistical evidence." (John A Paulos, "Innumeracy: Mathematical Illiteracy and its Consequences", 1988)

"A model for simulating dynamic system behavior requires formal policy descriptions to specify how individual decisions are to be made. Flows of information are continuously converted into decisions and actions. No plea about the inadequacy of our understanding of the decision-making processes can excuse us from estimating decision-making criteria. To omit a decision point is to deny its presence - a mistake of far greater magnitude than any errors in our best estimate of the process." (Jay W Forrester, "Policies, decisions and information sources for modeling", 1994)

"In constructing a model, we always attempt to maximize its usefulness. This aim is closely connected with the relationship among three key characteristics of every systems model: complexity, credibility, and uncertainty. This relationship is not as yet fully understood. We only know that uncertainty (predictive, prescriptive, etc.) has a pivotal role in any efforts to maximize the usefulness of systems models. Although usually (but not always) undesirable when considered alone, uncertainty becomes very valuable when considered in connection to the other characteristics of systems models: in general, allowing more uncertainty tends to reduce complexity and increase credibility of the resulting model. Our challenge in systems modelling is to develop methods by which an optimal level of allowable uncertainty can be estimated for each modelling problem." (George J Klir & Bo Yuan, "Fuzzy Sets and Fuzzy Logic: Theory and Applications", 1995)

"Delay time, the time between causes and their impacts, can highly influence systems. Yet the concept of delayed effect is often missed in our impatient society, and when it is recognized, it’s almost always underestimated. Such oversight and devaluation can lead to poor decision making as well as poor problem solving, for decisions often have consequences that don’t show up until years later. Fortunately, mind mapping, fishbone diagrams, and creativity/brainstorming tools can be quite useful here." (Stephen G Haines, "The Manager's Pocket Guide to Strategic and Business Planning", 1998)

“Accurate estimates depend at least as much upon the mental model used in forming the picture as upon the number of pieces of the puzzle that have been collected.” (Richards J. Heuer Jr, “Psychology of Intelligence Analysis”, 1999)

“[…] we underestimate the share of randomness in about everything […]  The degree of resistance to randomness in one’s life is an abstract idea, part of its logic counterintuitive, and, to confuse matters, its realizations nonobservable.” (Nassim N Taleb, “Fooled by Randomness”, 2001)

"Most long-range forecasts of what is technically feasible in future time periods dramatically underestimate the power of future developments because they are based on what I call the 'intuitive linear' view of history rather than the 'historical exponential' view." (Ray Kurzweil, "The Singularity is Near", 2005)

"[myth:] Accuracy is more important than precision. For single best estimates, be it a mean value or a single data value, this question does not arise because in that case there is no difference between accuracy and precision. (Think of a single shot aimed at a target.) Generally, it is good practice to balance precision and accuracy. The actual requirements will differ from case to case." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"As uncertainties of scientific data values are nearly as important as the data values themselves, it is usually not acceptable that a best estimate is only accompanied by an estimated uncertainty. Therefore, only the size of nondominant uncertainties should be estimated. For estimating the size of a nondominant uncertainty we need to find its upper limit, i.e., we want to be as sure as possible that the uncertainty does not exceed a certain value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Before best estimates are extracted from data sets by way of a regression analysis, the uncertainties of the individual data values must be determined.In this case care must be taken to recognize which uncertainty components are common to all the values, i.e., those that are correlated (systematic)." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[myth:] Counting can be done without error. Usually, the counted number is an integer and therefore without (rounding) error. However, the best estimate of a scientifically relevant value obtained by counting will always have an error. These errors can be very small in cases of consecutive counting, in particular of regular events, e.g., when measuring frequencies." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"Due to the theory that underlies uncertainties an infinite number of data values would be necessary to determine the true value of any quantity. In reality the number of available data values will be relatively small and thus this requirement can never be fully met; all one can get is the best estimate of the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the aim of all data analysis that a result is given in form of the best estimate of the true value. Only in simple cases is it possible to use the data value itself as result and thus as best estimate." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"It is the nature of an uncertainty that it is not known and can never be known, whether the best estimate is greater or less than the true value." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"The methodology of feedback design is borrowed from cybernetics (control theory). It is based upon methods of controlled system model’s building, methods of system states and parameters estimation (identification), and methods of feedback synthesis. The models of controlled system used in cybernetics differ from conventional models of physics and mechanics in that they have explicitly specified inputs and outputs. Unlike conventional physics results, often formulated as conservation laws, the results of cybernetical physics are formulated in the form of transformation laws, establishing the possibilities and limits of changing properties of a physical system by means of control." (Alexander L Fradkov, "Cybernetical Physics: From Control of Chaos to Quantum Control", 2007)

"A good estimator has to be more than just consistent. It also should be one whose variance is less than that of any other estimator. This property is called minimum variance. This means that if we run the experiment several times, the 'answers' we get will be closer to one another than 'answers' based on some other estimator." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"An estimate (the mathematical definition) is a number derived from observed values that is as close as we can get to the true parameter value. Useful estimators are those that are 'better' in some sense than any others." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"Estimators are functions of the observed values that can be used to estimate specific parameters. Good estimators are those that are consistent and have minimum variance. These properties are guaranteed if the estimator maximizes the likelihood of the observations." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"GIGO is a famous saying coined by early computer scientists: garbage in, garbage out. At the time, people would blindly put their trust into anything a computer output indicated because the output had the illusion of precision and certainty. If a statistic is composed of a series of poorly defined measures, guesses, misunderstandings, oversimplifications, mismeasurements, or flawed estimates, the resulting conclusion will be flawed." (Daniel J Levitin, "Weaponized Lies", 2017)

"One final warning about the use of statistical models (whether linear or otherwise): The estimated model describes the structure of the data that have been observed. It is unwise to extend this model very far beyond the observed data." (David S Salsburg, "Errors, Blunders, and Lies: How to Tell the Difference", 2017)

"One kind of probability - classic probability - is based on the idea of symmetry and equal likelihood […] In the classic case, we know the parameters of the system and thus can calculate the probabilities for the events each system will generate. […] A second kind of probability arises because in daily life we often want to know something about the likelihood of other events occurring […]. In this second case, we need to estimate the parameters of the system because we don’t know what those parameters are. […] A third kind of probability differs from these first two because it’s not obtained from an experiment or a replicable event - rather, it expresses an opinion or degree of belief about how likely a particular event is to occur. This is called subjective probability […]." (Daniel J Levitin, "Weaponized Lies", 2017)

"Samples give us estimates of something, and they will almost always deviate from the true number by some amount, large or small, and that is the margin of error. […] The margin of error does not address underlying flaws in the research, only the degree of error in the sampling procedure. But ignoring those deeper possible flaws for the moment, there is another measurement or statistic that accompanies any rigorously defined sample: the confidence interval." (Daniel J Levitin, "Weaponized Lies", 2017)

"The margin of error is how accurate the results are, and the confidence interval is how confident you are that your estimate falls within the margin of error." (Daniel J Levitin, "Weaponized Lies", 2017)

18 December 2018

🔭Data Science: Context (Just the Quotes)

"Some of the common ways of producing a false statistical argument are to quote figures without their context, omitting the cautions as to their incompleteness, or to apply them to a group of phenomena quite different to that to which they in reality relate; to take these estimates referring to only part of a group as complete; to enumerate the events favorable to an argument, omitting the other side; and to argue hastily from effect to cause, this last error being the one most often fathered on to statistics. For all these elementary mistakes in logic, statistics is held responsible." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"When evaluating the reliability and generality of data, it is often important to know the aims of the experimenter. When evaluating the importance of experimental results, however, science has a trick of disregarding the experimenter's rationale and finding a more appropriate context for the data than the one he proposed." (Murray Sidman, "Tactics of Scientific Research", 1960)

"Data in isolation are meaningless, a collection of numbers. Only in context of a theory do they assume significance […]" (George Greenstein, "Frozen Star" , 1983)

"Graphics must not quote data out of context." (Edward R Tufte, "The Visual Display of Quantitative Information", 1983)

"The problem solver needs to stand back and examine problem contexts in the light of different 'Ws' (Weltanschauungen). Perhaps he can then decide which 'W' seems to capture the essence of the particular problem context he is faced with. This whole process needs formalizing if it is to be carried out successfully. The problem solver needs to be aware of different paradigms in the social sciences, and he must be prepared to view the problem context through each of these paradigms." (Michael C Jackson, "Towards a System of Systems Methodologies", 1984)

"It is commonly said that a pattern, however it is written, has four essential parts: a statement of the context where the pattern is useful, the problem that the pattern addresses, the forces that play in forming a solution, and the solution that resolves those forces. [...] it supports the definition of a pattern as 'a solution to a problem in a context', a definition that [unfortunately] fixes the bounds of the pattern to a single problem-solution pair." (Martin Fowler, "Analysis Patterns: Reusable Object Models", 1997)

"We do not learn much from looking at a model - we learn more from building the model and manipulating it. Just as one needs to use or observe the use of a hammer in order to really understand its function, similarly, models have to be used before they will give up their secrets. In this sense, they have the quality of a technology - the power of the model only becomes apparent in the context of its use." (Margaret Morrison & Mary S Morgan, "Models as mediating instruments", 1999)

"Data are collected as a basis for action. Yet before anyone can use data as a basis for action the data have to be interpreted. The proper interpretation of data will require that the data be presented in context, and that the analysis technique used will filter out the noise."  (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"[…] you simply cannot make sense of any number without a contextual basis. Yet the traditional attempts to provide this contextual basis are often flawed in their execution. [...] Data have no meaning apart from their context. Data presented without a context are effectively rendered meaningless." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"All scientific theories, even those in the physical sciences, are developed in a particular cultural context. Although the context may help to explain the persistence of a theory in the face of apparently falsifying evidence, the fact that a theory arises from a particular context is not sufficient to condemn it. Theories and paradigms must be accepted, modified or rejected on the basis of evidence." (Richard P Bentall,  "Madness Explained: Psychosis and Human Nature", 2003)

"Mathematical modeling is as much ‘art’ as ‘science’: it requires the practitioner to (i) identify a so-called ‘real world’ problem (whatever the context may be); (ii) formulate it in mathematical terms (the ‘word problem’ so beloved of undergraduates); (iii) solve the problem thus formulated (if possible; perhaps approximate solutions will suffice, especially if the complete problem is intractable); and (iv) interpret the solution in the context of the original problem." (John A Adam, "Mathematics in Nature", 2003)

"Context is not as simple as being in a different space [...] context includes elements like our emotions, recent experiences, beliefs, and the surrounding environment - each element possesses attributes, that when considered in a certain light, informs what is possible in the discussion." (George Siemens, "Knowing Knowledge", 2006)

"Statistics can certainly pronounce a fact, but they cannot explain it without an underlying context, or theory. Numbers have an unfortunate tendency to supersede other types of knowing. […] Numbers give the illusion of presenting more truth and precision than they are capable of providing." (Ronald J Baker, "Measure what Matters to Customers: Using Key Predictive Indicators", 2006)

"A valid digit is not necessarily a significant digit. The significance of numbers is a result of its scientific context." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"[… ] statistics is about understanding the role that variability plays in drawing conclusions based on data. […] Statistics is not about numbers; it is about data - numbers in context. It is the context that makes a problem meaningful and something worth considering." (Roxy Peck et al, "Introduction to Statistics and Data Analysis" 4th Ed., 2012)

"Context (information that lends to better understanding the who, what, when, where, and why of your data) can make the data clearer for readers and point them in the right direction. At the least, it can remind you what a graph is about when you come back to it a few months later. […] Context helps readers relate to and understand the data in a visualization better. It provides a sense of scale and strengthens the connection between abstract geometry and colors to the real world." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Readability in visualization helps people interpret data and make conclusions about what the data has to say. Embed charts in reports or surround them with text, and you can explain results in detail. However, take a visualization out of a report or disconnect it from text that provides context (as is common when people share graphics online), and the data might lose its meaning; or worse, others might misinterpret what you tried to show." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"The data is a simplification - an abstraction - of the real world. So when you visualize data, you visualize an abstraction of the world, or at least some tiny facet of it. Visualization is an abstraction of data, so in the end, you end up with an abstraction of an abstraction, which creates an interesting challenge. […] Just like what it represents, data can be complex with variability and uncertainty, but consider it all in the right context, and it starts to make sense." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Without context, data is useless, and any visualization you create with it will also be useless. Using data without knowing anything about it, other than the values themselves, is like hearing an abridged quote secondhand and then citing it as a main discussion point in an essay. It might be okay, but you risk finding out later that the speaker meant the opposite of what you thought." (Nathan Yau, "Data Points: Visualization That Means Something", 2013)

"Statistics are meaningless unless they exist in some context. One reason why the indicators have become more central and potent over time is that the longer they have been kept, the easier it is to find useful patterns and points of reference." (Zachary Karabell, "The Leading Indicators: A short history of the numbers that rule our world", 2014)

"The term data, unlike the related terms facts and evidence, does not connote truth. Data is descriptive, but data can be erroneous. We tend to distinguish data from information. Data is a primitive or atomic state (as in ‘raw data’). It becomes information only when it is presented in context, in a way that informs. This progression from data to information is not the only direction in which the relationship flows, however; information can also be broken down into pieces, stripped of context, and stored as data. This is the case with most of the data that’s stored in computer systems. Data that’s collected and stored directly by machines, such as sensors, becomes information only when it’s reconnected to its context." (Stephen Few, "Signal: Understanding What Matters in a World of Noise", 2015)

"Infographics combine art and science to produce something that is not unlike a dashboard. The main difference from a dashboard is the subjective data and the narrative or story, which enhances the data-driven visual and engages the audience quickly through highlighting the required context." (Travis Murphy, "Infographics Powered by SAS®: Data Visualization Techniques for Business Reporting", 2018)

"For numbers to be transparent, they must be placed in an appropriate context. Numbers must presented in a way that allows for fair comparisons." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

"Without knowing the source and context, a particular statistic is worth little. Yet numbers and statistics appear rigorous and reliable simply by virtue of being quantitative, and have a tendency to spread." (Carl T Bergstrom & Jevin D West, "Calling Bullshit: The Art of Skepticism in a Data-Driven World", 2020)

More quotes on "Context" at the-web-of-knowledge.blogspot.com


15 December 2018

🔭Data Science: Probability (Just the Quotes)

"Probability is a degree of possibility." (Gottfried W Leibniz, "On estimating the uncertain", 1676)

"Probability, however, is not something absolute, [it is] drawn from certain information which, although it does not suffice to resolve the problem, nevertheless ensures that we judge correctly which of the two opposites is the easiest given the conditions known to us." (Gottfried W Leibniz, "Forethoughts for an encyclopaedia or universal science", cca. 1679)

"[…] the highest probability amounts not to certainty, without which there can be no true knowledge." (John Locke, "An Essay Concerning Human Understanding", 1689)

"As mathematical and absolute certainty is seldom to be attained in human affairs, reason and public utility require that judges and all mankind in forming their opinions of the truth of facts should be regulated by the superior number of the probabilities on the one side or the other whether the amount of these probabilities be expressed in words and arguments or by figures and numbers." (William Murray, 1773)

"All certainty which does not consist in mathematical demonstration is nothing more than the highest probability; there is no other historical certainty." (Voltaire, "A Philosophical Dictionary", 1881)

"Nature prefers the more probable states to the less probable because in nature processes take place in the direction of greater probability. Heat goes from a body at higher temperature to a body at lower temperature because the state of equal temperature distribution is more probable than a state of unequal temperature distribution." (Max Planck, "The Atomic Theory of Matter", 1909)

"Sometimes the probability in favor of a generalization is enormous, but the infinite probability of certainty is never reached." (William Dampier-Whetham, "Science and the Human Mind", 1912)

"There can be no unique probability attached to any event or behaviour: we can only speak of ‘probability in the light of certain given information’, and the probability alters according to the extent of the information." (Sir Arthur S Eddington, "The Nature of the Physical World", 1928)

"[…] the statistical prediction of the future from the past cannot be generally valid, because whatever is future to any given past, is in tum past for some future. That is, whoever continually revises his judgment of the probability of a statistical generalization by its successively observed verifications and failures, cannot fail to make more successful predictions than if he should disregard the past in his anticipation of the future. This might be called the ‘Principle of statistical accumulation’." (Clarence I Lewis, "Mind and the World-Order: Outline of a Theory of Knowledge", 1929)

"Science does not aim, primarily, at high probabilities. It aims at a high informative content, well backed by experience. But a hypothesis may be very probable simply because it tells us nothing, or very little." (Karl Popper, "The Logic of Scientific Discovery", 1934)

"The most important application of the theory of probability is to what we may call 'chance-like' or 'random' events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe - after many unsuccessful attempts - that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"Equiprobability in the physical world is purely a hypothesis. We may exercise the greatest care and the most accurate of scientific instruments to determine whether or not a penny is symmetrical. Even if we are satisfied that it is, and that our evidence on that point is conclusive, our knowledge, or rather our ignorance, about the vast number of other causes which affect the fall of the penny is so abysmal that the fact of the penny’s symmetry is a mere detail. Thus, the statement 'head and tail are equiprobable' is at best an assumption." (Edward Kasner & James R Newman, "Mathematics and the Imagination", 1940)

"Probabilities must be regarded as analogous to the measurement of physical magnitudes; that is to say, they can never be known exactly, but only within certain approximation." (Emile Borel, "Probabilities and Life", 1943)

"Just as entropy is a measure of disorganization, the information carried by a set of messages is a measure of organization. In fact, it is possible to interpret the information carried by a message as essentially the negative of its entropy, and the negative logarithm of its probability. That is, the more probable the message, the less information it gives. Clichés, for example, are less illuminating than great poems." (Norbert Wiener, "The Human Use of Human Beings", 1950)

"To say that observations of the past are certain, whereas predictions are merely probable, is not the ultimate answer to the question of induction; it is only a sort of intermediate answer, which is incomplete unless a theory of probability is developed that explains what we should mean by ‘probable’ and on what ground we can assert probabilities." (Hans Reichenbach, "The Rise of Scientific Philosophy", 1951)

"Uncertainty is introduced, however, by the impossibility of making generalizations, most of the time, which happens to all members of a class. Even scientific truth is a matter of probability and the degree of probability stops somewhere short of certainty." (Wayne C Minnick, "The Art of Persuasion", 1957)

"Everybody has some idea of the meaning of the term 'probability' but there is no agreement among scientists on a precise definition of the term for the purpose of scientific methodology. It is sufficient for our purpose, however, if the concept is interpreted in terms of relative frequency, or more simply, how many times a particular event is likely to occur in a large population." (Alfred R Ilersic, "Statistics", 1959)

"Incomplete knowledge must be considered as perfectly normal in probability theory; we might even say that, if we knew all the circumstances of a phenomenon, there would be no place for probability, and we would know the outcome with certainty." (Félix E Borel, Probability and Certainty", 1963)

"Probability is the mathematics of uncertainty. Not only do we constantly face situations in which there is neither adequate data nor an adequate theory, but many modem theories have uncertainty built into their foundations. Thus learning to think in terms of probability is essential. Statistics is the reverse of probability (glibly speaking). In probability you go from the model of the situation to what you expect to see; in statistics you have the observations and you wish to estimate features of the underlying model." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985) 

"Probability plays a central role in many fields, from quantum mechanics to information theory, and even older fields use probability now that the presence of 'noise' is officially admitted. The newer aspects of many fields start with the admission of uncertainty." (Richard W Hamming, "Methods of Mathematics Applied to Calculus, Probability, and Statistics", 1985)

"Probabilities are summaries of knowledge that is left behind when information is transferred to a higher level of abstraction." (Judea Pearl, "Probabilistic Reasoning in Intelligent Systems: Network of Plausible, Inference", 1988)

"[In statistics] you have the fact that the concepts are not very clean. The idea of probability, of randomness, is not a clean mathematical idea. You cannot produce random numbers mathematically. They can only be produced by things like tossing dice or spinning a roulette wheel. With a formula, any formula, the number you get would be predictable and therefore not random. So as a statistician you have to rely on some conception of a world where things happen in some way at random, a conception which mathematicians don’t have." (Lucien LeCam, [interview] 1988)

"So we pour in data from the past to fuel the decision-making mechanisms created by our models, be they linear or nonlinear. But therein lies the logician's trap: past data from real life constitute a sequence of events rather than a set of independent observations, which is what the laws of probability demand. [...] It is in those outliers and imperfections that the wildness lurks." (Peter L Bernstein, "Against the Gods: The Remarkable Story of Risk", 1996) 

"Often, we use the word random loosely to describe something that is disordered, irregular, patternless, or unpredictable. We link it with chance, probability, luck, and coincidence. However, when we examine what we mean by random in various contexts, ambiguities and uncertainties inevitably arise. Tackling the subtleties of randomness allows us to go to the root of what we can understand of the universe we inhabit and helps us to define the limits of what we can know with certainty." (Ivars Peterson, "The Jungles of Randomness: A Mathematical Safari", 1998)

"In the laws of probability theory, likelihood distributions are fixed properties of a hypothesis. In the art of rationality, to explain is to anticipate. To anticipate is to explain." (Eliezer S. Yudkowsky, "A Technical Explanation of Technical Explanation", 2005)

"For some scientific data the true value cannot be given by a constant or some straightforward mathematical function but by a probability distribution or an expectation value. Such data are called probabilistic. Even so, their true value does not change with time or place, making them distinctly different from  most statistical data of everyday life." (Manfred Drosg, "Dealing with Uncertainties: A Guide to Error Analysis", 2007)

"In fact, H [entropy] measures the amount of uncertainty that exists in the phenomenon. If there were only one event, its probability would be equal to 1, and H would be equal to 0 - that is, there is no uncertainty about what will happen in a phenomenon with a single event because we always know what is going to occur. The more events that a phenomenon possesses, the more uncertainty there is about the state of the phenomenon. In other words, the more entropy, the more information." (Diego Rasskin-Gutman, "Chess Metaphors: Artificial Intelligence and the Human Mind", 2009)

"The four questions of data analysis are the questions of description, probability, inference, and homogeneity. [...] Descriptive statistics are built on the assumption that we can use a single value to characterize a single property for a single universe. […] Probability theory is focused on what happens to samples drawn from a known universe. If the data happen to come from different sources, then there are multiple universes with different probability models.  [...] Statistical inference assumes that you have a sample that is known to have come from one universe." (Donald J Wheeler," Myths About Data Analysis", International Lean & Six Sigma Conference, 2012)

"Entropy is a measure of amount of uncertainty or disorder present in the system within the possible probability distribution. The entropy and amount of unpredictability are directly proportional to each other." (G Suseela & Y Asnath V Phamila, "Security Framework for Smart Visual Sensor Networks", 2019)

02 December 2018

🔭Data Science: Hypothesis (Just the Quotes)

"[…] it is not necessary that these hypotheses should be true, or even probably; but it is enough if they provide a calculus which fits the observations […]" (Andrew Osiander, "On the Revolutions of the Heavenly Spheres", 1543)

"The art of discovering the causes of phenomena, or true hypothesis, is like the art of decyphering, in which an ingenious conjecture greatly shortens the road." (Gottfried W Leibniz, "New Essays Concerning Human Understanding", 1704) [published 1765]

"In order to shake a hypothesis, it is sometimes not necessary to do anything more than push it as far as it will go." (Denis Diderot, "On the Interpretation of Nature", 1753)

"No hypothesis can lay claim to any value unless it assembles many phenomena under one concept." (Johann Wolfgang von Goethe, [letter to Sommering] 1795)

"Induction, analogy, hypotheses founded upon facts and rectified continually by new observations, a happy tact given by nature and strengthened by numerous comparisons of its indications with experience, such are the principal means for arriving at truth." (Pierre-Simon Laplace, "A Philosophical Essay on Probabilities", 1814)

"The hypothesis is like the captain, and the observations like the soldiers of an army: while he appears to command them, and in this way to work his own will, he does in fact derive all his power of conquest from their obedience, and becomes helpless and useless if they mutiny." (William Whewell, "Philosophy of the Inductive Sciences", 1840)

"The process of scientific discovery is cautious and rigorous, not by abstaining from hypothesis, but by rigorously comparing hypotheses with facts, and by resolutely rejecting all which the comparison does not confirm." (William Whewell, "The Philosophy of the Inductive Sciences Founded Upon Their History" Vol. 2, 1840)

"When the hypothesis, of itself and without adjustment for the purpose, gives us the rule and reason of a class of facts not contemplated in its construction, we have a criterion of its reality, which has never yet been produced in favour of falsehood." (William Whewell, "The Philosophy of the Inductive Sciences", 1840) 

"An hypothesis being a mere supposition, there are no other limits to hypotheses than those of the human imagination; we may, if we please, imagine, by way of accounting for an effect, some cause of a kind utterly unknown, and acting according to a law altogether fictitious." (John S Mill, "A System of Logic, Ratiocinative and Inductive", 1843)

"It appears, then, to be a condition of a genuinely scientific hypothesis, that it be not destined always to remain an hypothesis, but be certain to be either proved or disproved by [...] comparison with observed facts." (John S Mill, "A System of Logic, Ratiocinative and Inductive", 1843)

"The hypothesis, by suggesting observations and experiments, puts us upon the road to that independent evidence if it be really attainable; and till it be attained, the hypothesis ought not to count for more than a suspicion." (John S Mill, "A System of Logic, Ratiocinative and Inductive", 1843)

"The rules of scientific investigation always require us, when we enter the domains of conjecture, to adopt that hypothesis by which the greatest number of known facts and phenomena may be reconciled." (Matthew F Maury, "The Physical Geography of the Sea", 1855) 

"An anticipative idea or an hypothesis is, then, the necessary starting point for all experimental reasoning. Without it, we could not make any investigation at all nor learn anything; we could only pile up sterile observations. If we experiment without a preconceived idea, we should move at random […]" (Claude Bernard, "An Introduction to the Study of Experimental Medicine", 1865)

"In scientific investigations, it is permitted to invent any hypothesis and, if it explains various large and independent classes of facts, it rises to the ranks of a well-grounded theory." (Charles Darwin, "The Variations of Animals and Plants Under Domestication" Vol. 1, 1868)

"The great tragedy of Science - the slaying of a beautiful hypothesis by an ugly fact." (Thomas H Huxley, "Biogenesis and abiogenesis", [address] 1870)

"[…] wrong hypotheses, rightly worked from, have produced more useful results than unguided observation." (Augustus de Morgan, "A Budget of Paradoxes", 1872)

"An hypothesis is only a habit - a habit of looking through a glass of one peculiar colour, which imparts its hue to all around it." (Frederick Marryat, "The King's Own", 1873) 

"A discoverer is a tester of scientific ideas; he must not only be able to imagine likely hypotheses, and to select suitable ones for investigation, but, as hypotheses may be true or untrue, he must also be competent to invent appropriate experiments for testing them, and to devise the requisite apparatus and arrangements." (George Gore, "The Art of Scientific Discovery", 1878)

"The scientific discovery appears first as the hypothesis of an analogy; and science tends to become independent of the hypothesis." (William K Clifford, "Lectures and Essays", 1879)

"Every hypothesis must derive indubitable results from mechanically well-defined assumptions by mathematically correct methods." (Ludwig Boltzmann, "Certain Questions of the Theory of Gasses", Nature Vol. 51 (1322), 1895) 

"For the truly scientific man, the hypothesis is destined solely to enable him to get the facts of nature in some definite order, an order which shall make apparent their connection with the great order and harmony which is believed to be present in the universe." (James M Baldwin, "The Processes of Life Revealed by the Microscope: A Plea for Physiological Histology", Science N.S. Vol. 2 (34), 1895)

"If the working hypothesis fails in any essential particular he [the scientist] is ready to modify or discard it. For the truly inspired investigator, one undoubted fact weighs more in the balance than a thousand theories." (James M Baldwin, "The Processes of Life Revealed by the Microscope: A Plea for Physiological Histology", Science N.S. Vol. 2 (34), 1895)

"In scientific investigations, it is permitted to invent any hypothesis and, if it explains various large and independent classes of facts, it rises to the ranks of a well-grounded theory." (Charles Darwin, "The Variations of Animals and Plants Under Domestication" Vol. 1, 1896)

"Entia non sunt multiplicanda praeter necessitatem. That is to say; before you try a complicated hypothesis, you should make quite sure that no simplification of it will explain the facts equally well." (Charles S Peirce," Pragmatism and Pragmaticism", [lecture] 1903)

"A false hypothesis, if it serve as a guide for further enquiry, may, at the right stage of science, be as useful as, or more useful than, a truer one for which acceptable evidence is not yet at hand." (William C Dampier, "Science and the Human Mind, Science in the Ancient World", 1912) 

"Without hypothesis there can be no progress in knowledge." (Max Verworn, "Irritability", 1913) 

"The great difference between induction and hypothesis is that the former infers the existence of phenomena such as we have observed in cases which are similar, while hypothesis supposes something of a different kind from what we have directly observed, and frequently something which it would be impossible for us to observe directly." (Charles S Peirce, "Chance, Love and Logic: Philosophical Essays, Deduction, Induction, Hypothesis", 1914)

"Theory is the best guide for experiment - that were it not for theory and the problems and hypotheses that come out of it, we would not know the points we wanted to verify, and hence would experiment aimlessly" (Henry Hazlitt,  "Thinking as a Science", 1916)

"A good hypothesis in science must have other properties than those of the phenomenon it is immediately invoked to explain, otherwise it is not prolific enough." (William James, "Selected Papers on Philosophy", 1918) 

"An indispensable hypothesis, even though still far from being a guarantee of success, is however the pursuit of a specific aim, whose lighted beacon, even by initial failures, is not betrayed." (Max Planck, [Nobel lecture] 1918) 

"A hypothesis or theory is clear, decisive, and positive, but it is believed by no one but the man who created it. Experimental findings, on the other hand, are messy, inexact things, which are believed by everyone except the man who did the work." (Harlow Shapley, "Review of Scientific Instruments" Vol. 6, 1922) 

"However successful a theory or law may have been in the past, directly it fails to interpret new discoveries its work is finished, and it must be discarded or modified. However plausible the hypothesis, it must be ever ready for sacrifice on the altar of observation." (Joseph W Mellor, "A Comprehensive Treatise on Inorganic and Theoretical Chemistry", 1922) 

"Hypothesis, however, is an inference based on knowledge which is insufficient to prove its high probability." (Frederick L Barry, "The Scientific Habit of Thought", 1927) 

"Abstraction is the detection of a common quality in the characteristics of a number of diverse observations […] A hypothesis serves the same purpose, but in a different way. It relates apparently diverse experiences, not by directly detecting a common quality in the experiences themselves, but by inventing a fictitious substance or process or idea, in terms of which the experience can be expressed. A hypothesis, in brief, correlates observations by adding something to them, while abstraction achieves the same end by subtracting something." (Herbert Dingle, Science and Human Experience, 1931)

"Science does not aim, primarily, at high probabilities. It aims at a high informative content, well backed by experience. But a hypothesis may be very probable simply because it tells us nothing, or very little." (Karl Popper, "The Logic of Scientific Discovery", 1934) 

"All the theories and hypotheses of empirical science share this provisional character of being established and accepted ‘until further notice’, whereas a mathematical theorem, once proved, is established once and for all; it holds with that particular certainty which no subsequent empirical discoveries, however unexpected and extraordinary, can ever affect to the slightest extent." (Carl G Hempel, "Geometry and Empirical Science", 1935)

"In relation to any experiment we may speak of this hypothesis as the null hypothesis, and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Ronald Fisher, "The Design of Experiments", 1935)

"The laws of science are the permanent contributions to knowledge - the individual pieces that are fitted together in an attempt to form a picture of the physical universe in action. As the pieces fall into place, we often catch glimpses of emerging patterns, called theories; they set us searching for the missing pieces that will fill in the gaps and complete the patterns. These theories, these provisional interpretations of the data in hand, are mere working hypotheses, and they are treated with scant respect until they can be tested by new pieces of the puzzle." (Edwin P Whipple, "Experiment and Experience", [Commencement Address, California Institute of Technology] 1938)

"When two hypotheses are possible, we provisionally choose that which our minds adjudge to the simpler on the supposition that this Is the more likely to lead in the direction of the truth." (James H Jeans, "Physics and Philosophy" 3rd Ed., 1943)

"We see what we want to see, and observation conforms to hypothesis." (Bergen Evans, "The Natural History of Nonsense", 1946)

"A successful hypothesis is not necessarily a permanent hypothesis, but it is one which stimulates additional research, opens up new fields, or explains and coordinates previously unrelated facts." (Farrington Daniels, "Outlines of Physical Chemistry", 1948)

"There would be cases where we would not want to accept an hypothesis even though the evidence gives a high d. c. [degree of confirmation] score, because we are fearful of the consequences of a wrong decision." (C West Churchman, "Theory of Experimental Inference", 1948) 

"Hypothesis is a tool which can cause trouble if not used properly. We must be ready to abandon out hypothesis as soon as it is shown to be inconsistent with the facts." (William I B Beveridge, "The Art of Scientific Investigation", 1950) 

"A collection of observable concepts in a purely formal hypothesis suggesting no analogy with anything would consequently not suggest either any directions for its own development." (Mary B Hesse, "Operational Definition and Analogy in Physical Theories", British Journal for the Philosophy of Science 2 (8), 1952)

"Whenever we attempt to test a hypothesis we naturally try to avoid errors in judging it. This seems to indicate the right way of proceeding: when choosing a test we should try to minimize the frequency of errors that may be committed in applying it." (Jerzy Neyman, "Lectures and Conferences on Mathematical Statistics", 1952) 

"The only relevant test of the validity of a hypothesis is comparison of prediction with experience." (Milton Friedman, "Essays in Positive Economics", 1953)

"[…] the grand aim of all science […] is to cover the greatest possible number of empirical facts by logical deductions from the smallest possible number of hypotheses or axioms." (Albert Einstein, 1954)

"One must credit an hypothesis with all that has had to be discovered in order to demolish it." (Jean Rostand, "The substance of man", 1962)

"The formulation of a hypothesis carries with it an obligation to test it as rigorously as we can command skills to do so." (Peter Medawar, "Hypothesis and Imagination", 1963)

"Truth in science can be defined as the working hypothesis best suited to open the way to the next better one." (Konrad Lorenz, "On Aggression", 1963) 

"The validation of a model is not that it is 'true' but that it generates good testable hypotheses relevant to important problems." (Richard Levins, "The Strategy of Model Building in Population Biology", 1966)

"All testing, all confirmation and disconfirmation of a hypothesis takes place already within a system. And this system is not a more or less arbitrary and doubtful point of departure for all our arguments; no it belongs to the essence of what we call an argument. The system is not so much the point of departure, as the element in which our arguments have their life." (Ludwig Wittgenstein, "On Certainty", 1969) 

"Science consists simply of the formulation and testing of hypotheses based on observational evidence; experiments are important where applicable, but their function is merely to simplify observation by imposing controlled conditions." (Henry L Batten, "Evolution of the Earth", 1971)

"An experiment is a failure only when it also fails adequately to test the hypothesis in question, when the data it produces don't prove anything one way or the other." (Robert M Pirsig, "Zen and the Art of Motorcycle Maintenance", 1974)

"A hypothesis is empirical or scientific only if it can be tested by experience. […] A hypothesis or theory which cannot be, at least in principle, falsified by empirical observations and experiments does not belong to the realm of science." (Francisco J Ayala, "Biological Evolution: Natural Selection or Random Walk", American Scientist, 1974)

"A hypothesis will in the end become a truth when all phenomena let themselves be derived from it in a natural and in an obvious manner, when all these consequences are connected with one another and with the general reasons, in short, when that hypothesis is consistent in all its parts with itself." (Johann H Lambert, 1976)

"The essential function of a hypothesis consists in the guidance it affords to new observations and experiments, by which our conjecture is either confirmed or refuted." (Ernst Mach, "Knowledge and Error: Sketches on the Psychology of Enquiry", 1976)

"Be suspicious of a theory if more and more hypotheses are needed to support it as new facts become available, or as new considerations are brought to bear." (Sir Fred Hoyle & Nalin C Wickramasinghe, "Evolution from Space", 1981)

"All interpretations made by a scientist are hypotheses, and all hypotheses are tentative. They must forever be tested and they must be revised if found to be unsatisfactory. Hence, a change of mind in a scientist, and particularly in a great scientist, is not only not a sign of weakness but rather evidence for continuing attention to the respective problem and an ability to test the hypothesis again and again." (Ernst Mayr, "The Growth of Biological Thought: Diversity, Evolution and Inheritance", 1982)

"Don't just read it; fight it! Ask your own question, look for your own examples, dicover your own proofs. Is the hypothesis necessary? Is the converse true? What happens in the classical special case? What about the degenerate cases? Where does the proof use the hypothesis?" (Paul R Halmos, "I Want to be a Mathematician", 1985)

"Beware of the problem of testing too many hypotheses; the more you torture the data, the more likely they are to confess, but confessions obtained under duress may not be admissible in the court of scientific opinion." (Stephen M Stigler, "Testing Hypotheses or fitting Models? Another Look at Mass Extinctions" [in "Neutral Models in Biology"], 1987)

"All science is based on models, and every scientific model comprises three distinct stages: statement of well-defined hypotheses; deduction of all the consequences of these hypotheses, and nothing but these consequences; confrontation of these consequences with observed data." (Maurice Allais, "An Outline of My Main Contributions to Economic Science", [Noble lecture] 1988)

"Any physical theory is always provisional, in the sense that it is only a hypothesis: you can never prove it. No matter how many times the results of experiments agree with some theory, you can never be sure that the next time the result will not contradict the theory." (Stephen Hawking,  "A Brief History of Time", 1988)

"The heart of the scientific method is the problem-hypothesis-test process. And, necessarily, the scientific method involves predictions. And predictions, to be useful in scientific methodology, must be subject to test empirically." (Paul Davies, "The Cosmic Blueprint: New Discoveries in Nature's Creative Ability to, Order the Universe", 1988)

"The model and the theory it represents must be accepted, at least temporarily, or rejected, depending on the agreement or disagreement between observed data and the hypotheses and implications of the model. When neither the hypotheses nor the implications of a theory can be confronted with the real world, that theory is devoid of any scientific interest. Mere logical, even mathematical, deduction remains worthless in terms of the understanding of reality if it is not closely linked to that reality." (Maurice Allais, "An Outline of My Main Contributions to Economic Science", [Noble lecture] 1988)

"A fact is a simple statement that everyone believes. It is innocent, unless found guilty. A hypothesis is a novel suggestion that no one wants to believe. It is guilty, until found effective." (Edward Teller, "Conversations on the Dark Secrets of Physics", 1991)

"Visualizations can be used to explore data, to confirm a hypothesis, or to manipulate a viewer. [...] In exploratory visualization the user does not necessarily know what he is looking for. This creates a dynamic scenario in which interaction is critical. [...] In a confirmatory visualization, the user has a hypothesis that needs to be tested. This scenario is more stable and predictable. System parameters are often predetermined." (Usama Fayyad et al, "Information Visualization in Data Mining and Knowledge Discovery", 2002) 

"[…] a conceptual model is a diagram connecting variables and constructs based on theory and logic that displays the hypotheses to be tested." (Mary W Celsi et al, "Essentials of Business Research Methods", 2011)

"Data science is an iterative process. It starts with a hypothesis (or several hypotheses) about the system we’re studying, and then we analyze the information. The results allow us to reject our initial hypotheses and refine our understanding of the data. When working with thousands of fields and millions of rows, it’s important to develop intuitive ways to reject bad hypotheses quickly." (Phil Simon, "The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions", 2014)

"Observation and experiment, without a rational hypothesis, is like a man groping at objects at random with his eyes shut." (Henry P Tappan, "Elements of Logic", 2015)

"A hypothesis is a starting point for an investigation. When you hypothesize, you make a claim about why something might be the case, based on limited data, to offer an explanation or a path forward. You wouldn’t make a proposition about something you are certain of. You may not have enough evidence yet to even convince you that it’s true. But making such a claim puts a stake in the ground that suggests a path for focused analysis." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

"Data science is, in reality, something that has been around for a very long time. The desire to utilize data to test, understand, experiment, and prove out hypotheses has been around for ages. To put it simply: the use of data to figure things out has been around since a human tried to utilize the information about herds moving about and finding ways to satisfy hunger. The topic of data science came into popular culture more and more as the advent of ‘big data’ came to the forefront of the business world." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

"Pure data science is the use of data to test, hypothesize, utilize statistics and more, to predict, model, build algorithms, and so forth. This is the technical part of the puzzle. We need this within each organization. By having it, we can utilize the power that these technical aspects bring to data and analytics. Then, with the power to communicate effectively, the analysis can flow throughout the needed parts of an organization." (Jordan Morrow, "Be Data Literate: The data literacy skills everyone needs to succeed", 2021)

Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.