SQL Troubles: approximation

Showing posts with label approximation. Show all posts

03 March 2025

💎🤖SQL Reloaded: Copilot Stories (Part VIII: Recursive Functions)

In theory, recursive functions prove useful in generating a list of numbers based on a formula in which the final result is based on the result of the previous step. In other words, a problem is solved by calculating the previous iteration of the same problem. In practice, they are limited by the number of iterations allowed, and there is a hard limit in SQL Server, the available resources (usually memory), and probably a few other factors. Of course, some algorithms can be implemented to overcome the limitations, while other times there are quite simple linear implementations based on a loop, like in the below examples. Therefore, recursive functions are more of a theoretical concept.

Disclaimer: The below code was generated by Copilot via prompts and is used for demonstration purposes!

This post evaluates whether Microsoft 365 and Edge Copilot can be used to generate the code for some recursive functions.

Case 1: Factorials

Let's start with the factorial that calculates the product of the n consecutive numbers.

Prompt:
Generate a recursive function in SQL Server to calculate the factorial of a number.

Output (M365 Copilot):

-- factorial function in Microsoft 365 Copilot
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    IF @n <= 1
        RETURN 1
    ELSE
        RETURN @n * dbo.Factorial(@n - 1)
END

Output (Edge Copilot):

-- factorial function in Edge Copilot
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    IF @n = 0 OR @n = 1
        RETURN 1
    ELSE
        RETURN @n * dbo.Factorial(@n - 1)
END;

The structure seems correct, and the returning value was identified correctly as bigint, which allows the maximum value in working with integer values. Unfortunately, both functions lead to the same error in a SQL database:

"Msg 455, Level 16, State 2, Line 9, The last statement included within a function must be a return statement."

The formulas ca be consolidated in a single statement as follows:

-- factorial function adapted (minimal)
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    RETURN 
        CASE 
            WHEN @n = 0 OR @n = 1 THEN 1
        ELSE
            @n * dbo.Factorial(@n - 1) END
END;

Especially when formulas become complex, it might be a good idea to store the result into a variable and use it in the RETURN statement.

-- factorial function adapted
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @retval as bigint;

    IF @n = 0 OR @n = 1
        SET @retval = 1
    ELSE
        SET @retval = @n * dbo.Factorial(@n - 1)

    RETURN @retval
END;

And, here are some tests for the function:

-- testing the functions for multiple values
SELECT dbo.Factorial (0) -- returns 1
SELECT dbo.Factorial (1) -- returns 1
SELECT dbo.Factorial (2) -- returns 2
SELECT dbo.Factorial (3) -- returns 6
SELECT dbo.Factorial (4) -- returns 24
SELECT dbo.Factorial (5) -- returns 120
SELECT dbo.Factorial (20) -- returns 2432902008176640000
SELECT dbo.Factorial (21) -- returns an error for all bigger values

As can be seen, the function will return an error for values bigger than 20, output limited by the data type used. One can build further constraints to test for the maximum value.

Use the following code to drop the function as needed:

-- dropping the function
DROP FUNCTION IF EXISTS dbo.Factorial

Function's implementation has however another important caveat. If negative numbers are provided as parameters, the following error message is thrown:

"Msg 217, Level 16, State 1, Line 1, Maximum stored procedure, function, trigger, or view nesting level exceeded (limit 32)."

The final function can be defined as:

-- factorial function adapted (final)
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @retval as bigint;

    IF @n BETWEEN 0 AND 20
        IF @n = 0 OR @n = 1
            SET @retval = 1
        ELSE
            SET @retval = @n * dbo.Factorial(@n - 1)

    RETURN @retval
END;

One can linearize the function as a loop, though the limitation with the maximum value still exists:

Prompt:
Rewrite the factorial function using a loop.

Output (M365 Copilot):

-- non-recursive function to calculate the factorial of a number using a loop:
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @result BIGINT = 1;
    DECLARE @i INT = 1;

    WHILE @i <= @n
    BEGIN
        SET @result = @result * @i;
        SET @i = @i + 1;
    END

    RETURN @result;
END

-- test the code
SELECT dbo.Factorial(5) AS FactorialResult; -- result 120

One can obtain some improvement by changing the return type and @retsult from bigint to float, though the results are always approximations. Here are the approximated results:

-- test code
SELECT dbo.Factorial(20) AS FactorialResult; -- result 2432902008176640000
SELECT dbo.Factorial(21) AS FactorialResult; -- result 5.109094217170944E+19
SELECT dbo.Factorial(22) AS FactorialResult; -- result 1.1240007277776077E+21
SELECT dbo.Factorial(23) AS FactorialResult; -- result 2.585201673888498E+22
SELECT dbo.Factorial(24) AS FactorialResult; -- result 6.204484017332394E+23
SELECT dbo.Factorial(25) AS FactorialResult; -- result 1.5511210043330986E+25
SELECT dbo.Factorial(26) AS FactorialResult; -- result 4.0329146112660565E+26
SELECT dbo.Factorial(27) AS FactorialResult; -- result 1.0888869450418352E+28
SELECT dbo.Factorial(28) AS FactorialResult; -- result 3.0488834461171384E+29
SELECT dbo.Factorial(29) AS FactorialResult; -- result 8.841761993739701E+30
SELECT dbo.Factorial(30) AS FactorialResult; -- result 2.6525285981219103E+32
SELECT dbo.Factorial(31) AS FactorialResult; -- result 8.222838654177922E+33
SELECT dbo.Factorial(32) AS FactorialResult; -- result 2.631308369336935E+35
SELECT dbo.Factorial(170) AS FactorialResult; -- result 7.257415615307994E+306
SELECT dbo.Factorial(171) AS FactorialResult; -- results in arithmetic overflow

Another structure that allows creating recursive functions are the common table expressions (CTEs).

Prompt:
Create a recursive CTE in SQL Server for calculating n factorial.

Output (M365 Copilot):

-- n factorial as CTE
WITH FactorialCTE AS (
    -- Anchor member: factorial of 0 is 1
    SELECT 0 AS n, 1 AS factorial
    UNION ALL
    -- Recursive member: factorial of n is n * factorial of (n-1)
    SELECT n + 1, (n + 1) * factorial
    FROM FactorialCTE
    WHERE n < 10 -- Change this value to calculate factorial for larger numbers
)
SELECT n, factorial
FROM FactorialCTE;

One can cast the value in the anchor member as bigint or float as above with all the constraints resulting from this:

-- n factorial as CTE
WITH FactorialCTE AS (
    -- Anchor member: factorial of 0 is 1
    SELECT 0 AS n, Cast(1 as bigint) AS factorial
    UNION ALL
    -- Recursive member: factorial of n is n * factorial of (n-1)
    SELECT n + 1, (n + 1) * factorial
    FROM FactorialCTE
    WHERE n < 20 -- Change this value to calculate factorial for larger numbers
)
SELECT n, factorial
FROM FactorialCTE;

Case 2: Approximation of π

Power laws can be used to approximate the value of π and other irrational constants. Writing such a function via recursive functions has the same constraint as above, though there are alternatives (including storing the actual value as a constant).

Prompt:
Generate a recursive function in SQL Server to approximate the value of π using power laws.

Output (M365 Copilot):

-- approximate the value of π using this series:
CREATE FUNCTION dbo.ApproximatePi (@n INT)
RETURNS FLOAT
AS
BEGIN
    IF @n = 0
        RETURN 4.0
    ELSE
        RETURN 4.0 * POWER(-1, @n) / (2 * @n + 1) + dbo.ApproximatePi(@n - 1)
END

-- test the function
SELECT dbo.ApproximatePi(1000) AS PiApproximation;

Function's definition has the same issue as the previous example. It's needed to rewrite the function as a single return:

-- approximate the value of π using this series (rewritten)
CREATE FUNCTION dbo.ApproximatePi (@n INT)
RETURNS FLOAT
AS
BEGIN
    RETURN
        CASE 
            WHEN @n = 0 THEN 4.0
            ELSE
                4.0 * POWER(-1, @n) / (2 * @n + 1) + dbo.ApproximatePi(@n - 1)
        END
END

-- testing the function
SELECT dbo.ApproximatePi (0) -- output: 4
SELECT dbo.ApproximatePi (1) -- output: 2.666666666667
SELECT dbo.ApproximatePi (2) -- output: 3.466666666667
SELECT dbo.ApproximatePi (3) -- output: 2.8952380952389998
SELECT dbo.ApproximatePi (10) -- output: 3.232315809407
SELECT dbo.ApproximatePi (20) -- output: 3.189184782279
SELECT dbo.ApproximatePi (30) -- output: 3.1738423371919993
SELECT dbo.ApproximatePi (31) -- output: 3.110350273699999
SELECT dbo.ApproximatePi (32) -- output: 3.171888735237999
SELECT dbo.ApproximatePi (33) -- throws error

After 32 iterations the call throws the same error as above. Fortunately, such functions can be (re)written using non-recursive implementations, which is the proper way to implementation such functions:

Prompt:
Implement the function without using recursive functions.

Output (M365 Copilot):

-- approximation within loop
CREATE FUNCTION dbo.ApproximatePi (@terms INT)
RETURNS FLOAT
AS
BEGIN
    DECLARE @pi FLOAT = 0.0;
    DECLARE @i INT = 0;

    WHILE @i < @terms
    BEGIN
        SET @pi = @pi + 4.0 * POWER(-1, @i) / (2 * @i + 1);
        SET @i = @i + 1;
    END

    RETURN @pi;
END

This implementation is more flexible and a similar approach can be used for similar series. And, here are the results for various parameters:

-- testing the function
SELECT dbo.ApproximatePi(1000) AS PiApproximation; -- result: 3.140592653840002
SELECT dbo.ApproximatePi(10000) AS PiApproximation; -- result: 3.141492653571029
SELECT dbo.ApproximatePi(100000) AS PiApproximation; -- result: 3.141582653466945
SELECT dbo.ApproximatePi(1000000) AS PiApproximation; -- result: 3.1415916535590713

The last call is close, but probably not close enough, especially when one needs to use the function in multiple places. There are more complex power series which provide better performance faster. The direct use of the pi() function is usually recommended:

-- using the pi function
SELECT pi(); -- 3.141592653589793

Moreover, some series can be reduced to formulas based on constants like π, e or Φ.

And, a final piece of code:

-- drop the function
DROP FUNCTION IF EXISTS dbo.ApproximatePi

One can use a CTE also in this case:

Prompt:
Generate a recursive function in SQL Server to approximate the value of π using power laws.

Output (M365 Copilot):

-- series via recursive CTE

WITH PiApproximationCTE AS (
    -- Anchor member: start with k = 0
    SELECT 0 AS k, CAST(4.0 AS FLOAT) AS term, CAST(4.0 AS FLOAT) AS pi_approx
    UNION ALL
    -- Recursive member: calculate the next term and add it to the approximation
    SELECT 
        k + 1,
        CAST(4.0 * POWER(-1, k + 1) / (2 * (k + 1) + 1) AS FLOAT),
        pi_approx + CAST(4.0 * POWER(-1, k + 1) / (2 * (k + 1) + 1) AS FLOAT)
    FROM PiApproximationCTE
    WHERE k < 1000 -- Change this value to increase the number of terms
)
SELECT k, pi_approx
FROM PiApproximationCTE
OPTION (MAXRECURSION 0); -- Allow for more recursion levels if needed

This time Copilot generated a different faction for calculating the same. (The output is too long to be added in here as one record is generated for each iteration.)

Happy coding!

Previous Post <<||>> Next Post

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

Business Intelligence Series

There’s a saying that applies to many contexts ranging from software engineering to data analysis and visualization related solutions: "fools rush in where angels fear to tread" [1]. Much earlier, an adage attributed to Confucius provides a similar perspective: "do not try to rush things; ignore matters of minor advantage". Ignoring these advices, there's the drive in rapid prototyping to jump in with both feet forward without checking first how solid the ground is, often even without having adequate experience in the field. That’s understandable to some degree – people want to see progress and value fast, without building a foundation or getting an understanding of what’s happening, respectively possible, often ignoring the full extent of the problems.

A prototype helps to bring the requirements closer to what’s intended to achieve, though, as the practice often shows, the gap between the initial steps and the final solutions require many iterations, sometimes even too many for making a solution cost-effective. There’s almost always a tradeoff between costs and quality, respectively time and scope. Sooner or later, one must compromise somewhere in between even if the solution is not optimal. The fuzzier the requirements and what’s achievable with a set of data, the harder it gets to find the sweet spot.

Even if people understand the steps, constraints and further aspects of a process relatively easily, making sense of the data generated by it, respectively using the respective data to optimize the process can take a considerable effort. There’s a chain of tradeoffs and constraints that apply to a certain situation in each context, that makes it challenging to always find optimal solutions. Moreover, optimal local solutions don’t necessarily provide the optimum effect when one looks at the broader context of the problems. Further on, even if one brought a process under control, it doesn’t necessarily mean that the process works efficiently.

This is the broader context in which data analysis and visualization topics need to be placed to build useful solutions, to make a sensible difference in one’s job. Especially when the data and processes look numb, one needs to find the perspectives that lead to useful information, respectively knowledge. It’s not realistic to expect to find new insight in any set of data. As experience often proves, insight is rarer than finding gold nuggets. Probably, the most important aspect in gold mining is to know where to look, though it also requires luck, research, the proper use of tools, effort, and probably much more.

One of the problems in working with data is that usually data is analyzed and visualized in aggregates at different levels, often without identifying and depicting the factors that determine why data take certain shapes. Even if a well-suited set of dimensions is defined for data analysis, data are usually still considered in aggregate. Having the possibility to change between aggregates and details is quintessential for data’s understanding, or at least for getting an understanding of what's happening in the various processes.

There is one aspect of data modeling, respectively analysis and visualization that’s typically ignored in BI initiatives – process-wise there is usually data which is not available and approximating the respective values to some degree is often far from the optimal solution. Of course, there’s often a tradeoff between effort and value, though the actual value can be quantified only when gathering enough data for a thorough first analysis. It may also happen that the only benefit is getting a deeper understanding of certain aspects of the processes, respectively business. Occasionally, this price may look high, though searching for cost-effective solutions is part of the job!

Previous Post <<||>> Next Post

References:
[1] Alexander Pope (cca. 1711) An Essay on Criticism

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

More examples are available nowadays for developers, at least compared with 10-20 years ago when besides the scarce documentation, the blogs and source code from books were the only ways to find how a function or a piece of standard functionality works. Copying code without understanding it may lead to unexpected results, with all the consequences resulting from this.

A recent example in this direction in KQL are the dcount and dcountif functions, which according to the documentation calculates an estimate of the number of distinct values that are taken by a scalar expression in the summary group. An estimate is not the actual number of records, trading performance for accuracy. The best example are the following pieces of code:

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distict_count_by_value  = dcountif(CustomerKey, SalesAmount <> 0) //estimated count of records with not null amounts
    , distict_count_by_value  = count_distinctif(CustomerKey, SalesAmount <> 0) // count of records with not null amounts
    , aprox_distict_count_by_value2  = dcountif(CustomerKey, SalesAmount == 0) //estimated count of records with null amounts
    , distict_count_by_value2  = count_distinctif(CustomerKey, SalesAmount == 0) // count of records with not amounts
| extend error_aprox_distinct_count = distinct_count - aprox_distinct_count
    , error_aprox_distict_count_by_value = distict_count_by_value - aprox_distict_count_by_value

Output:

record_count	aprox_distinct_count	distinct_count	aprox_distict_count_by_value	distict_count_by_value	aprox_distict_count_by_value2	distict_count_by_value2
2832193	18497	18484	18497	18484	10251	10219
error_aprox_distinct_count		error_aprox_distict_count_by_value
-13		-13

It's interesting that the same difference is observable also when a narrower time interval is chosen (e.g. 1 month). When using estimate it's important to understand also how big is the error between the actual value and the estimate, and that's the purpose of the last two lines added to the query. In many scenarios the difference might be neglectable until is not.

One can wonder whether the two functions are deterministic, in other words whether they return the same results if given the same input values. It would be also useful to understand what's the performance of the two estimative functions especially when further constraints are applied.

Moreover, the functions accept a third parameter which allows control over the trade between speed and accuracy (see provided table).

// counting records 
NewSales
| summarize record_count = count() // values availanle 
    , distinct_count = count_distinct(CustomerKey) // actual number of records
    , aprox_distinct_count = dcount(CustomerKey) // estimated disting values (default)
    , aprox_distinct_count0 = dcount(CustomerKey, 0) // 0-based accuracy
    , aprox_distinct_count1 = dcount(CustomerKey, 1) // 1-based accuracy (default)
    , aprox_distinct_count2 = dcount(CustomerKey, 2) // 2-based accuracy
    , aprox_distinct_count3 = dcount(CustomerKey, 3) // 3-based accuracy
    , aprox_distinct_count4 = dcount(CustomerKey, 4) // 4-based accuracy

Output:

record_count	distinct_count	aprox_distinct_count	aprox_distinct_count0	aprox_distinct_count1	aprox_distinct_count2	aprox_distinct_count3	aprox_distinct_count4
2832193	18484	18497	18793	18497	18500	18470	18487

It will be interesting to see which one of these parameters are used in practice. The problems usually start when different approximation parameters are used alternatively with no previous agreement. How could one argument in the favor of one parameter over the others?

A natural question: how big will be the error introduced by each parameter? Usually, when approximating values, one needs to specify also the expected error somehow. The documentation provide some guiding value, though are these values enough? Do similar estimate functions make sense also for the other aggregate functions?

In exchange, the count_distinct and count_distinctif seem to be still in preview, with all the consequences derived from this. They are supposed to be more resource-intensive than the estimative counterparts. Conversely, the values returned can be still rounded in dashboards up to the meaningful unit (e.g. thousands), and this usually depends on the context. The question whether the values can be rounded can be put also in the context and the estimative counterparts. It would be interesting to check how far away are the rounded values from each other in the context of the two sets of functions.

In practice, counting is useful for calculating percentages (e.g. how many customers come from a certain zone compared to the total), which are more useful and easier to grasp than big numbers:

// calculating percentages from totals
NewSales
| where SalesAmount <> 0 and ProductCategoryName == 'TV and Video'
| where DateKey >=date(2023-02-01) and DateKey < datetime(2023-03-01)
| summarize count_customers = count_distinct(CustomerKey)
    , count_customers_US = count_distinctif(CustomerKey, RegionCountryName == 'United States')
    , count_customers_CA = count_distinctif(CustomerKey, RegionCountryName == 'Canada')
    , count_customers_other = count_distinctif(CustomerKey, not(RegionCountryName in ('United States', 'Canada')))
| extend percent_customers_US = iif(count_customers<>0, round(100.00 * count_customers_US/count_customers, 2), 0.00)
    , percent_customers_CA = iif(count_customers<>0, round(100.00 * count_customers_CA/count_customers, 2), 0.00)
    , percent_customers_other = iif(count_customers<>0, round(100.00 * count_customers_other/count_customers,2), 0.00)

Output:

count_customers	count_customers_US	count_customers_CA	count_customers_other	percent_customers_US	percent_customers_CA	percent_customers_other
10317	3912	789	5616	37.92	7.65	54.43

Note:
When showing percentages it's important to provide also the "context", the actual count or amount. This allows to understand the scale associated with the percentages.

Happy coding!

Previous Post <<||>> Next Post

Resources:
[R1] Arcane Code (2022) Fun With KQL – DCount, by R C Cain [link]

[R2] M Morowczynsk et al (2024) "The Definitive Guide to KQL" [sample]
[R3] M Zorich (2022) Too much noise in your data? Summarize it! [link]

11 March 2024

🧭🚥Business Intelligence: Key Performance Indicators [KPI] (Between Certainty and Uncertainty)

Business Intelligence Series

Despite the huge collection of documented Key Performance Indicators (KPIs) and best practices on which KPIs to choose, choosing a reliable set of KPIs that reflect how the organization performs in achieving its objectives continues to be a challenge for many organizations. Ideally, for each objective there should be only one KPIs that reflects the target and the progress made, though is that realistic?

Let's try to use the driver's metaphor to exemplify several aspects related to the choice of KPIs. A driver's goal is to travel from point A to point B over a distance d in x hours. The goal is SMART (Specific, Measurable, Achievable, Relevant, and Time-bound) if the speed and time are realistic and don't contradict Physics, legal or physical laws. The driver can define the objective as "arriving on time to the destination".

One can define a set of metrics based on the numbers that can be measured. We have the overall distance and the number of hours planned, from which one can derive an expected average speed v. To track a driver's progress over time there are several metrics that can be thus used: e.g., (1) the current average speed, (2) the number of kilometers to the destination, (3) the number of hours estimated to the destination. However, none of these metrics can be used alone to denote the performance alone. One can compare the expected with the current average speed to get a grasp of the performance, and probably many organizations will use only (1) as KPI, though it's needed to use either (2) or (3) to get the complete picture. So, in theory two KPIs should be enough. Is it so?

When estimating (3) one assumes that there are no impediments and that the average speed can be attained, which might be correct for a road without traffic. There can be several impediments - planned/unplanned breaks, traffic jams, speed limits, accidents or other unexpected events, weather conditions (that depend on the season), etc. Besides the above formula, one needs to quantify such events in one form or another, e.g., through the perspective of the time added to the initial estimation from (3). However, this calculation is based on historical values or navigator's estimation, value which can be higher or lower than the final value.

Therefore, (3) is an approximation for which is needed also a confidence interval (± t hours). The value can still include a lot of uncertainty that maybe needs to be broken down and quantified separately upon case to identify the deviation from expectations, e.g. on average there are 3 traffic jams (4), if the road crosses states or countries there may be at least 1 control on average (5), etc. These numbers can be included in (3) and the confidence interval, and usually don't need to be reported separately, though probably there are exceptions.

When planning, one needs to also consider the number of stops for refueling or recharging the car, and the average duration of such stops, which can be included in (3) as well. However, (3) slowly becomes too complex a formula, and even if there's an estimation, the more facts we're pulling into it, the bigger the confidence interval's variation will be. Sometimes, it's preferable to have instead two-three other metrics with a low confidence interval than one with high variation. Moreover, the longer the distance planned, the higher the uncertainty. One thing is to plan a trip between two neighboring city, and another thing is to plan a trip around the world.

Another assumption is that the capability of the driver/car to drive is the same over time, which is not always the case. This can be neglected occasionally (e.g. one trip), though it involves a risk (6) that might be useful to quantify, especially when the process is repeatable (e.g. regular commuting). The risk value can increase considering new information, e.g. knowing that every a few thousand kilometers something breaks, or that there's a traffic fine, or an accident. In spite of new information, the objective might also change. Also, the objective might suffer changes, e.g. arrive on-time safe and without fines to the destination. As the objective changes or further objectives are added, more metrics can be defined. It would make sense to measure how many kilometers the driver covered in a lifetime with the car (7), how many accidents (8) or how many fines (9) the driver had. (7) is not related to a driver's performance, but (8) and (9) are.

As can be seen, simple processes can also become very complex if one attempts to consider all the facts and/or quantify the uncertainty. The driver's metaphor applies to a simple individual, though once the same process is considered across the whole organization (a group of drivers), the more complexity is added and the perspective changes completely. E.g., some drivers might not even reach the destination or not even have a car to start with, and so on. Of course, with this also the objectives change and need to be redefined accordingly.

The driver's metaphor is good for considering planning activities in which a volume of work needs to be completed in a given time and where a set of constraints apply. Therefore, for some organizations, just using two numbers might be enough for getting a feeling for what's happening. However, as soon one needs to consider other aspects like safety or compliance (considered in aggregation across many drivers), there might be other metrics that qualify as KPIs.

It's tempting to add two numbers and consider for example (8) and (9) together as the two are events that can be cumulated, even if they refer to different things that can overlap (an accident can result in a fine and should be counted maybe only once). One needs to make sure that one doesn't add apples with juice - the quantified values must have the same unit of measure, otherwise they might need to be considered separately. There's the tendency of mixing multiple metrics in a KPI that doesn't say much if the units of measure of its components are not the same. Some conversions can still be made (e.g. how much juice can be obtained from apples), though that's seldom the case.

Previous Post <<||>> Next Post

17 December 2018

🔭Data Science: Method (Just the Quotes)

"There are two aspects of statistics that are continually mixed, the method and the science. Statistics are used as a method, whenever we measure something, for example, the size of a district, the number of inhabitants of a country, the quantity or price of certain commodities, etc. […] There is, moreover, a science of statistics. It consists of knowing how to gather numbers, combine them and calculate them, in the best way to lead to certain results. But this is, strictly speaking, a branch of mathematics." (Alphonse P de Candolle, "Considerations on Crime Statistics", 1833)

"The process of discovery is very simple. An unwearied and systematic application of known laws to nature, causes the unknown to reveal themselves. Almost any mode of observation will be successful at last, for what is most wanted is method." (Henry D Thoreau, "A Week on the Concord and Merrimack Rivers", 1862)

"As systematic unity is what first raises ordinary knowledge to the rank of science, that is, makes a system out of a mere aggregate of knowledge, architectonic is the doctrine of the scientific in our knowledge, and therefore necessarily forms part of the doctrine of method." (Immanuel Kant, "Critique of Pure Reason", 1871)

"Nothing is more certain in scientific method than that approximate coincidence alone can be expected. In the measurement of continuous quantity perfect correspondence must be accidental, and should give rise to suspicion rather than to satisfaction." (William S Jevons, "The Principles of Science: A Treatise on Logic and Scientific Method", 1874)

"The object of statistical science is to discover methods of condensing information concerning large groups of allied facts into brief and compendious expressions suitable for discussion. The possibility of doing this is based on the constancy and continuity with which objects of the same species are found to vary." (Sir Francis Galton, "Inquiries into Human Faculty and Its Development, Statistical Methods", 1883)

"Physical research by experimental methods is both a broadening and a narrowing field. There are many gaps yet to be filled, data to be accumulated, measurements to be made with great precision, but the limits within which we must work are becoming, at the same time, more and more defined." (Elihu Thomson, "Annual Report of the Board of Regents of the Smithsonian Institution", 1899)

"A statistical estimate may be good or bad, accurate or the reverse; but in almost all cases it is likely to be more accurate than a casual observer’s impression, and the nature of things can only be disproved by statistical methods." (Sir Arthur L Bowley, "Elements of Statistics", 1901)

"A method is a dangerous thing unless its underlying philosophy is understood, and none more dangerous than the statistical. […] Over-attention to technique may actually blind one to the dangers that lurk about on every side- like the gambler who ruins himself with his system carefully elaborated to beat the game. In the long run it is only clear thinking, experienced methods, that win the strongholds of science." (Edwin B Wilson, "The Statistical Significance of Experimental Data", Science, Volume 58 (1493), 1923)

"[…] the methods of statistics are so variable and uncertain, so apt to be influenced by circumstances, that it is never possible to be sure that one is operating with figures of equal weight." (Havelock Ellis, "The Dance of Life", 1923)

"Statistics may be regarded as (i) the study of populations, (ii) as the study of variation, and (iii) as the study of methods of the reduction of data." (Sir Ronald A Fisher, "Statistical Methods for Research Worker", 1925)

"Science is but a method. Whatever its material, an observation accurately made and free of compromise to bias and desire, and undeterred by consequence, is science." (Hans Zinsser, "Untheological Reflections", The Atlantic Monthly, 1929)

"The most important application of the theory of probability is to what we may call 'chance-like' or 'random' events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe - after many unsuccessful attempts - that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events." (Karl R Popper, "The Logic of Scientific Discovery", 1934)

"The fundamental difference between engineering with and without statistics boils down to the difference between the use of a scientific method based upon the concept of laws of nature that do not allow for chance or uncertainty and a scientific method based upon the concepts of laws of probability as an attribute of nature." (Walter A Shewhart, 1940)

"[Statistics] is both a science and an art. It is a science in that its methods are basically systematic and have general application; and an art in that their successful application depends to a considerable degree on the skill and special experience of the statistician, and on his knowledge of the field of application, e.g. economics." (Leonard H C Tippett, "Statistics", 1943)

"Statistics is the branch of scientific method which deals with the data obtained by counting or measuring the properties of populations of natural phenomena. In this definition 'natural phenomena' includes all the happenings of the external world, whether human or not " (Sir Maurice G Kendall, "Advanced Theory of Statistics", Vol. 1, 1943)

"We can scarcely imagine a problem absolutely new, unlike and unrelated to any formerly solved problem; but if such a problem could exist, it would be insoluble. In fact, when solving a problem, we should always profit from previously solved problems, using their result or their method, or the experience acquired in solving them." (George Polya, 1945)

"The enthusiastic use of statistics to prove one side of a case is not open to criticism providing the work is honestly and accurately done, and providing the conclusions are not broader than indicated by the data. This type of work must not be confused with the unfair and dishonest use of both accurate and inaccurate data, which too commonly occurs in business. Dishonest statistical work usually takes the form of: (1) deliberate misinterpretation of data; (2) intentional making of overestimates or underestimates; and (3) biasing results by using partial data, making biased surveys, or using wrong statistical methods." (John R Riggleman & Ira N Frisbee, "Business Statistics", 1951)

"Statistics is the fundamental and most important part of inductive logic. It is both an art and a science, and it deals with the collection, the tabulation, the analysis and interpretation of quantitative and qualitative measurements. It is concerned with the classifying and determining of actual attributes as well as the making of estimates and the testing of various hypotheses by which probable, or expected, values are obtained. It is one of the means of carrying on scientific research in order to ascertain the laws of behavior of things - be they animate or inanimate. Statistics is the technique of the Scientific Method." (Bruce D Greenschields & Frank M Weida, "Statistics with Applications to Highway Traffic Analyses", 1952)

"The methods of science may be described as the discovery of laws, the explanation of laws by theories, and the testing of theories by new observations. A good analogy is that of the jigsaw puzzle, for which the laws are the individual pieces, the theories local patterns suggested by a few pieces, and the tests the completion of these patterns with pieces previously unconsidered." (Edwin P Hubble, "The Nature of Science and Other Lectures", 1954)

"We have to remember that what we observe is not nature herself, but nature exposed to our method of questioning." (Werner K Heisenberg, "Physics and Philosophy: The revolution in modern science", 1958)

"We are committed to the scientific method, and measurement is the foundation of that method; hence we are prone to assume that whatever is measurable must be significant and that whatever cannot be measured may as well be disregarded." (Joseph W Krutch, "Human Nature and the Human Condition", 1959)

"Scientific method is the way to truth, but it affords, even in principle, no unique definition of truth. Any so-called pragmatic definition of truth is doomed to failure equally." (Willard v O Quine, "Word and Object", 1960)

"Observation, reason, and experiment make up what we call the scientific method." (Richard Feynman, "Mainly mechanics, radiation, and heat", 1963)

"Engineering is the art of skillful approximation; the practice of gamesmanship in the highest form. In the end it is a method broad enough to tame the unknown, a means of combing disciplined judgment with intuition, courage with responsibility, and scientific competence within the practical aspects of time, of cost, and of talent." (Ronald B Smith, "Professional Responsibility of Engineering", Mechanical Engineering Vol. 86 (1), 1964)

"Statistics is a body of methods and theory applied to numerical evidence in making decisions in the face of uncertainty." (Lawrence Lapin, "Statistics for Modern Business Decisions", 1973)

"Statistical methods of analysis are intended to aid the interpretation of data that are subject to appreciable haphazard variability." (David V. Hinkley & David Cox, "Theoretical Statistics", 1974)

"Scientists use mathematics to build mental universes. They write down mathematical descriptions - models - that capture essential fragments of how they think the world behaves. Then they analyse their consequences. This is called 'theory'. They test their theories against observations: this is called 'experiment'. Depending on the result, they may modify the mathematical model and repeat the cycle until theory and experiment agree. Not that it's really that simple; but that's the general gist of it, the essence of the scientific method." (Ian Stewart & Martin Golubitsky, "Fearful Symmetry: Is God a Geometer?", 1992)

"But our ways of learning about the world are strongly influenced by the social preconceptions and biased modes of thinking that each scientist must apply to any problem. The stereotype of a fully rational and objective ‘scientific method’, with individual scientists as logical (and interchangeable) robots, is self-serving mythology." (Stephen J Gould, "This View of Life: In the Mind of the Beholder", Natural History Vol. 103, No. 2, 1994)

"The methods of science include controlled experiments, classification, pattern recognition, analysis, and deduction. In the humanities we apply analogy, metaphor, criticism, and (e)valuation. In design we devise alternatives, form patterns, synthesize, use conjecture, and model solutions." (Béla H Bánáthy, "Designing Social Systems in a Changing World", 1996)

"Data are generally collected as a basis for action. However, unless potential signals are separated from probable noise, the actions taken may be totally inconsistent with the data. Thus, the proper use of data requires that you have simple and effective methods of analysis which will properly separate potential signals from probable noise." (Donald J Wheeler, "Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)

"No matter what the data, and no matter how the values are arranged and presented, you must always use some method of analysis to come up with an interpretation of the data.
While every data set contains noise, some data sets may contain signals. Therefore, before you can detect a signal within any given data set, you must first filter out the noise." (Donald J Wheeler," Understanding Variation: The Key to Managing Chaos" 2nd Ed., 2000)
"Scientists pursue ideas in an ill-defined but effective way that is often called the scientific method. There is no strict rule of procedure that will lead you from a good idea to a Nobel prize or even to a publishable discovery. Some scientists are meticulously careful; others are highly creative. The best scientists are probably both careful and creative. Although there are various scientific methods in use, a typical approach consists of a series of steps." (Peter Atkins et al, "Chemical Principles: The Quest for Insight" 6th ed., 2013)

"Science, at its core, is simply a method of practical logic that tests hypotheses against experience. Scientism, by contrast, is the worldview and value system that insists that the questions the scientific method can answer are the most important questions human beings can ask, and that the picture of the world yielded by science is a better approximation to reality than any other." (John M Greer, "After Progress: Reason and Religion at the End of the Industrial Age", 2015)

"The general principles of starting with a well-defined question, engaging in careful observation, and then formulating hypotheses and assessing the strength of evidence for and against them became known as the scientific method." (Michael Friendly & Howard Wainer, "A History of Data Visualization and Graphic Communication", 2021)

SQL Troubles

Pages

03 March 2025

💎🤖SQL Reloaded: Copilot Stories (Part VIII: Recursive Functions)

15 February 2025

🧭Business Intelligence: Perspectives (Part 27: A Tale of Two Cities II)

06 February 2025

🌌🏭KQL Reloaded: First Steps (Part VI: Actual vs. Estimated Count)

11 March 2024

🧭🚥Business Intelligence: Key Performance Indicators [KPI] (Between Certainty and Uncertainty)

17 December 2018

🔭Data Science: Method (Just the Quotes)

About Me