Showing posts with label data types. Show all posts
Showing posts with label data types. Show all posts

03 March 2025

💎🤖SQL Reloaded: Copilot Stories (Part VIII: Recursive Functions)

In theory, recursive functions prove useful in generating a list of numbers based on a formula in which the final result is based on the result of the previous step. In other words, a problem is solved by calculating the previous iteration of the same problem. In practice, they are limited by the number of iterations allowed, and there is a hard limit in SQL Server, the available resources (usually memory), and probably a few other factors. Of course, some algorithms can be implemented to overcome the limitations, while other times there are quite simple linear implementations based on a loop, like in the below examples. Therefore, recursive functions are more of a theoretical concept.   

Disclaimer: The below code was generated by Copilot via prompts and is used for demonstration purposes!

This post evaluates whether Microsoft 365 and Edge Copilot can be used to generate the code for some recursive functions. 

Case 1: Factorials

Let's start with the factorial that calculates the product of the n consecutive numbers.

Prompt:
Generate a recursive function in SQL Server to calculate the factorial of a number.

Output (M365 Copilot):

-- factorial function in Microsoft 365 Copilot
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    IF @n <= 1
        RETURN 1
    ELSE
        RETURN @n * dbo.Factorial(@n - 1)
END

Output (Edge Copilot):

-- factorial function in Edge Copilot
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    IF @n = 0 OR @n = 1
        RETURN 1
    ELSE
        RETURN @n * dbo.Factorial(@n - 1)
END;

The structure seems correct, and the returning value was identified correctly as bigint, which allows the maximum value in working with integer values. Unfortunately, both functions lead to the same error in a SQL database:

"Msg 455, Level 16, State 2, Line 9, The last statement included within a function must be a return statement."

The formulas ca be consolidated in a single statement as follows:

-- factorial function adapted (minimal)
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    RETURN 
        CASE 
            WHEN @n = 0 OR @n = 1 THEN 1
        ELSE
            @n * dbo.Factorial(@n - 1) END
END;

Especially when formulas become complex, it might be a good idea to store the result into a variable and use it in the RETURN statement.

-- factorial function adapted
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @retval as bigint;

    IF @n = 0 OR @n = 1
        SET @retval = 1
    ELSE
        SET @retval = @n * dbo.Factorial(@n - 1)

    RETURN @retval
END;

And, here are some tests for the function:

-- testing the functions for multiple values
SELECT dbo.Factorial (0) -- returns 1
SELECT dbo.Factorial (1) -- returns 1
SELECT dbo.Factorial (2) -- returns 2
SELECT dbo.Factorial (3) -- returns 6
SELECT dbo.Factorial (4) -- returns 24
SELECT dbo.Factorial (5) -- returns 120
SELECT dbo.Factorial (20) -- returns 2432902008176640000
SELECT dbo.Factorial (21) -- returns an error for all bigger values

As can be seen, the function will return an error for values bigger than 20, output limited by the data type used. One can build further constraints to test for the maximum value.

Use the following code to drop the function as needed:

-- dropping the function
DROP FUNCTION IF EXISTS dbo.Factorial 

Function's implementation has however another important caveat. If negative numbers are provided as parameters, the following error message is thrown:

"Msg 217, Level 16, State 1, Line 1, Maximum stored procedure, function, trigger, or view nesting level exceeded (limit 32)."

The final function can be defined as:

-- factorial function adapted (final)
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @retval as bigint;

    IF @n BETWEEN 0 AND 20
        IF @n = 0 OR @n = 1
            SET @retval = 1
        ELSE
            SET @retval = @n * dbo.Factorial(@n - 1)

    RETURN @retval
END;

One can linearize the function as a loop, though the limitation with the maximum value still exists:

Prompt:
Rewrite the factorial function using a loop.

Output (M365 Copilot):

-- non-recursive function to calculate the factorial of a number using a loop:
CREATE FUNCTION dbo.Factorial (@n INT)
RETURNS BIGINT
AS
BEGIN
    DECLARE @result BIGINT = 1;
    DECLARE @i INT = 1;

    WHILE @i <= @n
    BEGIN
        SET @result = @result * @i;
        SET @i = @i + 1;
    END

    RETURN @result;
END

-- test the code
SELECT dbo.Factorial(5) AS FactorialResult; -- result 120

One can obtain some improvement by changing the return type and @retsult from bigint to float, though the results are always approximations. Here are the approximated results:

-- test code
SELECT dbo.Factorial(20) AS FactorialResult; -- result 2432902008176640000
SELECT dbo.Factorial(21) AS FactorialResult; -- result 5.109094217170944E+19
SELECT dbo.Factorial(22) AS FactorialResult; -- result 1.1240007277776077E+21
SELECT dbo.Factorial(23) AS FactorialResult; -- result 2.585201673888498E+22
SELECT dbo.Factorial(24) AS FactorialResult; -- result 6.204484017332394E+23
SELECT dbo.Factorial(25) AS FactorialResult; -- result 1.5511210043330986E+25
SELECT dbo.Factorial(26) AS FactorialResult; -- result 4.0329146112660565E+26
SELECT dbo.Factorial(27) AS FactorialResult; -- result 1.0888869450418352E+28
SELECT dbo.Factorial(28) AS FactorialResult; -- result 3.0488834461171384E+29
SELECT dbo.Factorial(29) AS FactorialResult; -- result 8.841761993739701E+30
SELECT dbo.Factorial(30) AS FactorialResult; -- result 2.6525285981219103E+32
SELECT dbo.Factorial(31) AS FactorialResult; -- result 8.222838654177922E+33
SELECT dbo.Factorial(32) AS FactorialResult; -- result 2.631308369336935E+35
SELECT dbo.Factorial(170) AS FactorialResult; -- result 7.257415615307994E+306
SELECT dbo.Factorial(171) AS FactorialResult; -- results in arithmetic overflow

Another structure that allows creating recursive functions are the common table expressions (CTEs).

Prompt:
Create a recursive CTE in SQL Server for calculating n factorial.

Output (M365 Copilot):

-- n factorial as CTE
WITH FactorialCTE AS (
    -- Anchor member: factorial of 0 is 1
    SELECT 0 AS n, 1 AS factorial
    UNION ALL
    -- Recursive member: factorial of n is n * factorial of (n-1)
    SELECT n + 1, (n + 1) * factorial
    FROM FactorialCTE
    WHERE n < 10 -- Change this value to calculate factorial for larger numbers
)
SELECT n, factorial
FROM FactorialCTE;

One can cast the value in the anchor member as bigint or float as above with all the constraints resulting from this:

-- n factorial as CTE
WITH FactorialCTE AS (
    -- Anchor member: factorial of 0 is 1
    SELECT 0 AS n, Cast(1 as bigint) AS factorial
    UNION ALL
    -- Recursive member: factorial of n is n * factorial of (n-1)
    SELECT n + 1, (n + 1) * factorial
    FROM FactorialCTE
    WHERE n < 20 -- Change this value to calculate factorial for larger numbers
)
SELECT n, factorial
FROM FactorialCTE;

Case 2: Approximation of π

Power laws can be used to approximate the value of π and other irrational constants. Writing such a function via recursive functions has the same constraint as above, though there are alternatives (including storing the actual value as a constant). 

Prompt:
Generate a recursive function in SQL Server to approximate the value of π using power laws.

Output (M365 Copilot):

-- approximate the value of π using this series:
CREATE FUNCTION dbo.ApproximatePi (@n INT)
RETURNS FLOAT
AS
BEGIN
    IF @n = 0
        RETURN 4.0
    ELSE
        RETURN 4.0 * POWER(-1, @n) / (2 * @n + 1) + dbo.ApproximatePi(@n - 1)
END

-- test the function
SELECT dbo.ApproximatePi(1000) AS PiApproximation;

Function's definition has the same issue as the previous example. It's needed to rewrite the function as a single return:

-- approximate the value of π using this series (rewritten)
CREATE FUNCTION dbo.ApproximatePi (@n INT)
RETURNS FLOAT
AS
BEGIN
    RETURN
        CASE 
            WHEN @n = 0 THEN 4.0
            ELSE
                4.0 * POWER(-1, @n) / (2 * @n + 1) + dbo.ApproximatePi(@n - 1)
        END
END

-- testing the function
SELECT dbo.ApproximatePi (0) -- output: 4
SELECT dbo.ApproximatePi (1) -- output: 2.666666666667
SELECT dbo.ApproximatePi (2) -- output: 3.466666666667
SELECT dbo.ApproximatePi (3) -- output: 2.8952380952389998
SELECT dbo.ApproximatePi (10) -- output: 3.232315809407
SELECT dbo.ApproximatePi (20) -- output: 3.189184782279
SELECT dbo.ApproximatePi (30) -- output: 3.1738423371919993
SELECT dbo.ApproximatePi (31) -- output: 3.110350273699999
SELECT dbo.ApproximatePi (32) -- output: 3.171888735237999
SELECT dbo.ApproximatePi (33) -- throws error

After 32 iterations the call throws the same error as above. Fortunately, such functions can be (re)written using non-recursive implementations, which is the proper way to implementation such functions:

Prompt:
Implement the function without using recursive functions.

Output (M365 Copilot):

-- approximation within loop
CREATE FUNCTION dbo.ApproximatePi (@terms INT)
RETURNS FLOAT
AS
BEGIN
    DECLARE @pi FLOAT = 0.0;
    DECLARE @i INT = 0;

    WHILE @i < @terms
    BEGIN
        SET @pi = @pi + 4.0 * POWER(-1, @i) / (2 * @i + 1);
        SET @i = @i + 1;
    END

    RETURN @pi;
END 

This implementation is more flexible and a similar approach can be used for similar series. And, here are the results for various parameters:

-- testing the function
SELECT dbo.ApproximatePi(1000) AS PiApproximation; -- result: 3.140592653840002
SELECT dbo.ApproximatePi(10000) AS PiApproximation; -- result: 3.141492653571029
SELECT dbo.ApproximatePi(100000) AS PiApproximation; -- result: 3.141582653466945
SELECT dbo.ApproximatePi(1000000) AS PiApproximation; -- result: 3.1415916535590713

The last call is close, but probably not close enough, especially when one needs to use the function in multiple places. There are more complex power series which provide better performance faster. The direct use of the pi() function is usually recommended:

-- using the pi function
SELECT pi(); -- 3.141592653589793

Moreover, some series can be reduced to formulas based on constants like π, e or Φ.

And, a final piece of code:

-- drop the function
DROP FUNCTION IF EXISTS dbo.ApproximatePi

One can use a CTE also in this case:

Prompt:
Generate a recursive function in SQL Server to approximate the value of π using power laws.

Output (M365 Copilot):

-- series via recursive CTE
WITH PiApproximationCTE AS (
    -- Anchor member: start with k = 0
    SELECT 0 AS k, CAST(4.0 AS FLOAT) AS term, CAST(4.0 AS FLOAT) AS pi_approx
    UNION ALL
    -- Recursive member: calculate the next term and add it to the approximation
    SELECT 
        k + 1,
        CAST(4.0 * POWER(-1, k + 1) / (2 * (k + 1) + 1) AS FLOAT),
        pi_approx + CAST(4.0 * POWER(-1, k + 1) / (2 * (k + 1) + 1) AS FLOAT)
    FROM PiApproximationCTE
    WHERE k < 1000 -- Change this value to increase the number of terms
)
SELECT k, pi_approx
FROM PiApproximationCTE
OPTION (MAXRECURSION 0); -- Allow for more recursion levels if needed

This time Copilot generated a different faction for calculating the same. (The output is too long to be added in here as one record is generated  for each iteration.)

Happy coding!

Previous Post <<||>> Next Post

09 February 2025

🏭🗒️Microsoft Fabric: Kusto Query Language (KQL) [Notes]

Disclaimer: This is work in progress intended to consolidate information from various sources for learning purposes. For the latest information please consult the documentation (see the links below)! 

Last updated: 9-Feb-2025

[Microsoft Fabric] Kusto Query Language (KQL)

  • {def} a read-only request to process query language [1]
    • designed for data exploration and summarization [1]
      • very similar to SQL
        • the explain command can be used to transform SQL into KQL code
          • ⇐ not all the SQL syntax can be translated
    • statements are sequenced being executed in the order of their arrangement
      • funnel like processing where data is piped from one operator to the next
        • data is filtered, rearranged or summarized at each step and then fed into the following step
        • statements are sequenced by a pipe (|)
    • returns data in a tabular or graph format
    • designed and developed to take advantage of cloud computing through clustering and scaling compute [2]
      • ideal engine to power fast, real-time dashboards
    • case-sensitive in general 
    • named after the undersea pioneer Jacques Cousteau [2]
    • operation sequence
      • filter data
      • aggregate data
      • order data
      • modify column output
  • supports standard data types 
    • string
      • a sequence of zero or more Unicode characters
      • characters are encoded in UTF-8.
    • int 
      • 32-bit whole-number integer
    • long
      • signed 64-bit whole-number integer
    • real (aka double)
      • 64-bit decimal-based number 
      • and provides high precision with decimal points.
    • decimal
      • a 128-bit decimal number
      • provides the highest precision of decimal points
      • {recommendation} if precision is not needed, use the real type instead [2]
    • bool 
      • a boolean value that can be a true (1), false (0), or null
    • datetime 
      • represents a date in the UTC zone
    • timespan 
      • represents a time interval
        •  days, hours, minutes, seconds, milliseconds, microseconds, tick
      • if no time frame is specified, it will default to day
    • dynamic
      • a special data type that can take 
        • any value from the other data types
        • arrays
        • a {name = value} property bag
    • guid
      • a 128-bit globally unique value
  • statement types
    • tabular expression statement
    • let statement
      • used to 
        • set variable names equal to an expression
        • create views
      • ⇐ used mostly to 
        • help break complex expressions into multiple parts, each represented by a variable
        • sett constants outside the query to aid in readability
    • set statement
      • used to set the query duration
  • {tool}Microsoft Santinel 
    • {def} a cloud native SIEM and SOAR that provides cyberthreat detection, investigation, response, and proactive hunting, with a bird's-eye view across your enterprise [3]
  • {tool} Kusto Explorer
    • {def} user-friendly interface to query and analyze data with KQL [4]
  • {tool} Azure Data Studio 
    • {def} lightweight, cross-platform data management and development tool for data professionals [5]

References:
[1] Microsoft (2024) Real-time Analytics: End-to-End Workshop
[2] Mark Morowczynski et al (2024) The Definitive Guide to KQL: Using Kusto Query Language for Operations, Defending, and Threat Hunting
[3] Microsoft Learn (2024) Azure: What is Microsoft Sentinel [link]
[4] Microsoft Learn (2024) Kusto: Kusto.Explorer installation and user interface [link]
[5] Microsoft Learn (2024) SQL: What is Azure Data Studio? [link]

Resources:
[R1] Microsoft Learn (2025) Fabric: What's new in Microsoft Fabric? [link]

Acronyms:
KQL - Kusto Query Language (
SIEM - security information and event management
SOAR - security orchestration, automation, and response
SQL - Structured Query Language
UTC - Universal Time Coordinated

30 December 2020

🧊Data Warehousing: ETL (Part V: The Transform Subprocess)

Data Warehousing

As part of the ETL process, the Transform subprocess is responsible for bridging the gap between source and destination by leveraging SQL or the rich set of (data) transformations available in ETL tools, either to enable the implicit or explicit conversion between source and destination data types, or to transform the data as needed. 

Transformations act on data as operators, the challenge being to transform the data in the smallest number of steps in the most efficient way. Some of the transformations available in the ETL tools (e.g. conversions, sorting, sampling, joins, lookups, aggregation, pivoting, unpivoting) can be replaced by SQL-based logic. One can easily prepare the data directly in the extraction query, taking thus advantage of the power provided by the database engines. Moreover, the logic can be encapsulated in views or other objects and called as required by the extraction logic when the source database allows it. This approach allows maintaining the logic independently of the ETL packages.

Unfortunately, SQL can replace the transformations that address sequential logic and not workflow-related logic (e.g. conditional splitd, merges, multicasts, slowly changing dimensions) or logic that includes certain computational complexity (e.g. fuzzy groupings or lookups). Such gaps need to be filled by the ETL tools via the built-in transformations, by allowing developers to build custom logic or simple use COTS solutions, when they prove capable of filling the gap. 

Copying the data 1:1 at table or entity-level from the source system(s) involves in theory the simplest transformations, transformations revolving mainly around conversions between data types. The casual troublemakers are the numeric and date values, which can be found in different formats or precisions in the various environments. As this can apply to the ETL environment itself, it’s important to consider environment-agnostic data types when possible (e.g. strings). 

Other sources for concerns are the user-defined data types which don’t have equivalents between the systems, needing thus additional transformations for further handing, respectively the invalid values which need to be handled accordingly. Besides the data from the source system(s) and the derived values, upon case one needs to consider the parameter-based or hardcoded metadata created in the process. 

Independently of the purpose of the ETL packages it is usually required to document the data flow associated with them and the rules applied in transformations in what is known as a mapping document. Such a document needs to be understandable by the business, as it can serve for Data Management, projects, or other purposes.  Even if it’s almost impossible to document everything, at minimum needs to be provided the source and destination tables, the attributes considered in the mappings, respectively the most important rules the business should be aware of. Otherwise, the technical people can always turn back to the SQL queries, when needed. 

Some sources consider each non-trivial transformation as a business rule. Even if the rules used in transformations constrain the (business) data, not each rule is relevant for the business to the degree that it constrains some part of the business.

Data Migrations involve transformations between (database) schemas. Therefore, the logic requested to move the data could be handled in theory with a few well-designed packages, though there are considerations like logic complexity, transparency, flexibility, performance or auditability which could be better handled by using other techniques (e.g. saving the data in intermediary tables, breaking down the logic in several steps). Such considerations can apply also to simple ETL packages. Therefore, it’s important to recognize such scenarios, weight the choices and choose what fits best. However, unless one knows what one’s doing, it’s recommended to use the methods one knows best. 

Previous Post <<||>> Next Post

13 August 2017

#️⃣Software Engineering: SQL Reloaded (Patt II: Who Messed with My Data?)


Introduction

"Errors, like straws, upon the surface flow;
He who would search for pearls must dive below."

(John Dryden) 

Life of a programmer is full of things that stopped working overnight. What’s beautiful about such experiences is that always there is a logical explanation for such “happenings”. There are two aspects - one is how to troubleshoot such problems, and the second – how to avoid such situations, and this is typically done through what we refer as defensive programming. On one side avoiding issues makes one’s life simpler, while issues make it fuller.

I can say that I had plenty such types of challenges in my life, most of them self-created, mainly in the learning process, but also a good share of challenges created by others. Independently of the time spent on troubleshooting such issues, it’s the experience that counts, the little wins against the “dark” side of programming. In the following series of posts I will describe some of the issues I was confronted directly or indirectly over time. In an ad-hoc characterization they can be split in syntax, logical, data, design and systemic errors.

Syntax Errors

"Watch your language young man!"

(anonymous mother) 

Syntax in natural languages like English is the sequence in which words are put together, word’s order indicating the relationship existing between words. Based on the meaning the words carry and the relationships formed between words we are capable to interpret sentences. SQL, initially called SEQUEL (Structured English Query Language) is an English-like language designed to manipulate and retrieve data. Same as natural languages, artificial languages like SQL have their own set of (grammar) rules that when violated lead to runtime errors, leading to interruption in code execution or there can be cases when the code runs further leading to inconsistencies in data. Unlike natural languages, artificial languages interpreters are quite sensitive to syntax errors.

Syntax errors are common to beginners, though a moment of inattention or misspelling can happen to anyone, no matter how versatile one’s coding is. Some are more frequent or have a bigger negative impact than others. Here are some of the typical types of syntax errors:
- missing brackets and quotes, especially in complex formulas;
- misspelled commands, table or column names;
- omitting table aliases or database names;
- missing objects or incorrectly referenced objects or other resources;
- incorrect statement order;
- relying on implicit conversion;
- incompatible data types;
- incorrect parameters’ order;
- missing or misplaced semicolons;
- usage of deprecated syntax.

Typically, syntax errors are easy to track at runtime with minimal testing as long the query is static. Dynamic queries on the other side require sometimes a larger number of combinations to be tested. The higher the number of attributes to be combined and the more complex the logic behind them, the more difficult is to test all combinations. The more combinations not tested, the higher the probability that an error might lurk in the code. Dynamics queries can thus easily become (syntax) error generators.

Logical Errors

"Students are often able to use algorithms to solve numerical problems
without completely understanding the underlying scientific concept."

(Eric Mazur) 

One beautiful aspect of the human mind is that it needs only a rough understanding about how a tool works in order to make use of it up to an acceptable level. Therefore often it settles for the minimum of understanding that allows it to use a tool. Aspects like the limits of a tool, contexts of applicability, how it can be used efficiently to get the job done, or available alternatives, all these can be ignored in the process. As the devil lies in details, misunderstanding how a piece of technology works can prove to be our Achilles’ heel. For example, misunderstanding how sets and the different types of joins work, that lexical order differ from logical order and further to order of execution, when is appropriate or inappropriate to use a certain technique or functionality can make us make poor choices.

One of these poor choices is the method used to solve a problem. A mature programming language can offer sometimes two or more alternatives for solving a problem. Choosing the inadequate solution can lead to performance issues in time. This type of errors can be rooted in the lack of understanding of the data, of how an application is used, or how a piece of technology works.

"I suppose it is tempting, if the only tool you have is a hammer,
to treat everything as if it were a nail."

(Abraham Maslow) 

Some of the errors derive from the difference between how different programming languages work with data. There can be considerable differences between procedural, relational and vector languages. When jumping from one language to another, one can be tempted to apply the same old techniques to the new language. The solution might work, though (by far) not optimal.

The capital mistake is to be the man of one tool, and use it in all the cases, even when not appropriate. For example. when one learned working with views, attempts to apply them all over the code in order to reuse logic, creating thus chains of views which even prove to be flexible, their complexity sooner or later will kick back. Same can happen with stored procedures and other object types as well. A sign of mastery is when the developer adapts his tools to the purpose.

"For every complex problem there is an answer
that is clear, simple, and wrong.
"
(Henry L Mencken) 

One can build elegant solutions but solve the wrong problem. Misunderstanding the problem at hand is one type of error sometimes quite difficult to identify. Typically, they can be found through thorough testing. Sometimes the unavailability of (quality) data can impede the process of testing, such errors being found late in the process.

At the opposite side, one can attempt to solve the right problem but with logic flaws – wrong steps order, wrong algorithm, wrong set of tools, or even missing facts/assumptions. A special type of logical errors are the programmatic errors, which occur when SQL code encounters a logic or behavioral error during processing (e.g. infinite loop, out of range input). [1]

Data Errors

"Data quality requires certain level of sophistication within a company
to even understand that it’s a problem."

(Colleen Graham) 

Poor data quality is the source for all evil, or at least for some of the evil. Typically, a good designed database makes use of a mix of techniques to reduce the chances for inconsistencies: appropriate data types and data granularity, explicit transactions, check constraints, default values, triggers or integrity constraints. Some of these techniques can be too restrictive, therefore in design one has to provide a certain flexibility in the detriment of one of the above techniques, fact that makes the design vulnerable to same range of issues: missing values, missing or duplicate records.

No matter how good a database was designed, sometimes is difficult to cope with users’ ingenuity – misusage of functionality, typically resulting in deviations from standard processes, that can invalidate an existing query. Similar effects have the changes to processes or usage of new processed not addressed in existing queries or reports.

Another topic that have a considerable impact on queries’ correctness is the existence, or better said the inexistence of master data policies and a board to regulate the maintenance of master data. Without proper governance of master data one might end up with a big mess with no way to bring some order in it without addressing the quality of data adequately.

Designed to Fail

"The weakest spot in a good defense is designed to fail."
(Mark Lawrence) 

In IT one can often meet systems designed to fail, the occurrences of errors being just a question of time, kind of a ticking bomb. In such situations, a system is only as good as its weakest link(s). Issues can be traced back to following aspects:
- systems used for what they were not designed to do – typically misusing a tool for a purpose for which another tool would be more appropriate (e.g. using Excel as database, using SSIS for real-time, using a reporting tool for data entry);
- poor performing systems - systems not adequately designed for the tasks supposed to handle (e.g. handling large volume of data/transactions);
- systems not coping with user’s inventiveness or mistakes (e.g. not validating adequately user input or not confirming critical actions like deletion of records);
- systems not configurable (e.g. usage of hardcoded values instead of parameters or configurable values);
- systems for which one of the design presumptions were invalidated by reality (e.g. input data don’t have the expected format, a certain resource always exists);
- systems not being able to handle changes in environment (e.g. changing user settings for language, numeric or data values);
- systems succumbing in their own complexity (e.g. overgeneralization, wrong mix of technologies);
- fault intolerant systems – system not handling adequately more or less unexpected errors or exceptions (e.g. division by zero, handling of nulls, network interruptions, out of memory).

Systemic Errors

Systemic errors can be found at the borders of the "impossible", situations in which the errors defy the common sense. Such errors are not determined by chance but are introduced by an inaccuracy inherent to the system/environment.

A systemic error occurs when a SQL program encounters a deficiency or unexpected condition with a system resource (e.g. a program encountered insufficient space in tempdb to process a large query, database/transaction log running out of space). [1]

Such errors are often difficult but not impossible to reproduce. The difficulty resides primarily in figuring out what happened, what caused the error. Once one found the cause, with a little resourcefulness one can come with an example to reproduce the error.

Conclusion

"To err is human; to try to prevent recurrence of error is science."
(Anon)

When one thinks about it, there are so many ways to fail. In the end to err is human and nobody is exempted from making mistakes, no matter how good or wise. The quest of a (good) programmer is to limit errors’ occurrences, and to correct them early in process, before they start becoming a nightmare.

References:
[1] Transact-SQL Programming: Covers Microsoft SQL Server 6.5 /7.0 and Sybase,  by Kevin Kline, Lee Gould & Andrew Zanevsky, O’Reilly, ISBN 10: 1565924010, 1999

30 November 2010

🧭Business Intelligence: Enterprise Reporting (Part IX: How Many Reporting Systems Do You Need?)

Business Intelligence
Business Intelligence Series

I hope nobody’s questioning the fact that an organization needs at least a reporting system in order to take advantage of the huge amount of data it has collected over time, in order to have an overview on what’s happening in the organization, take better decision supported by data, etc. In one way or another many organizations arrive to have in place two or more reporting systems and all the consequences deriving from it. From experience, I would expect that many professionals have put themselves at least once the above question, often they having to live with their decisions and they are not always happy about them. 

On one side we have the demand, the needs of the various departments and groups existing in an organization, while on the other side of the balance we have the various types of data and the technologies existing on the market (could be regarded as the supply), and the various constraints an organization might have: resources (financial, human, processing power, time), politics and philosophies, geography, complexity, etc. If they don’t seem so many you have to consider that the respective constraints could be further split into a long chain of causality, the books on Project Management, Architecture, Methodologies and other related topics treating them in detail, so no need to go there.

From practical reasons, I’d like to reformulate the above question as follows:
1. Is it enough only one reporting system or do we need more than one?
2. What’s the optimum number of reporting systems existing in an organization?
 
A blunt answer for the first question would be that an organization might need as many reporting systems as it’s needed in order to satisfy the existing reporting needs, in other words the demand. Straightforward answer but not really acceptable for a Manager, in particular, and an IT Professional, in general. Another blunt answer, this time for the second question, would be that an organization needs only “one and good” reporting system, though that’s contradicting the various constraints existing in IT world. So, what should it be? 

Typically there are two views, the bottom up view, in this case starting with the data and building up to the reporting demands, and top down view, starting from the reporting requirements to data. In this play-set the technologies, in general, the reporting systems, in particular, are playing the role of middle point, the two views applying in respect to data vs. reporting technologies, reporting technologies vs. reporting requirements, increasing the complexity of the view. Sometimes is enough to focus only on the reporting requirements, why we consider then also the data in this scenario?! Remember, reporting it’s all about harnessing the value of your data, not only in finding the right reporting solution.

Often, in order to understand the requirements of an organization and understanding the value of data it’s easier to look at the various types of data it stores – the bottom up approach. Typically we could make distinction between master vs. transactional data, raw vs. aggregated data, structured vs. semi-structured vs. non-structured data, cleaned vs. not cleaned data, historical vs. live data, mined, distributed or heterogenous data, to mention just a few of the important ways of categorizing the data.

The respective types of data are important because they require special types of techniques or tools to process, report and harness them. The master and transactional data are at the heart of business applications, the “fluid” that flows and nourishes them, with an incredible potential when harnessed at its real value. Such unaltered data are typically structured at macro level, though they could contain also semi-structured or non-structured chunks of data, could have different levels of quality, could be distributed or heterogenous, etc. The range of data repositories range from text or document systems to specialized databases, any mixture being in theory possible.

So we have a huge volume of data collected over time, how could we harness them in order to reveal their real value? Excepting the operational use, the usefulness of raw data, data found in their unaltered form, stops usually at the point where their complexity falls beyond people’s limits of understanding them. Thus we arrived to aggregate, recombine or extract chunks from the data, bringing the data to a more comprehensible form. We even arrived to extract more information out of the data using specific techniques known as data mining methods, algorithms that allow us to identify associations, cluster or interpolate the data, the complexity of such algorithms evolving in considerable in the past two decades.
 
There are few software applications which provide the whole range of types of data processing, most of them resuming in recombining and presenting the data. Also data presentation offers a considerable range of formats (e.g. tabular, text, XML, Excel, charts and other diagrams) with complex navigational functionality (slice-and-dice, drill-down, drill-through, click-through), different ways of making the data available for consumption (direct-access, cached, web services), different security levels, etc.

When it comes to managers and users they want the data at the level of detail and form that allows the easiest/lowest proximate understanding level, and, “yesterday”, to use a word that reflects the urgency of requirements. This could be done in theory by coming with a whole set of reports covering all possible requirements, though that’s not so efficient as investment and not always possible with a button click, as probably all we’d like to. 

For the ones familiarized to Manufacturing, it’s actually a disguised push vs. pull scenario, pushing the reports to the user without expecting their demands vs. waiting for the user to forward the reporting requirements, from which to derive eventually new reports or make changes to existing ones. In Manufacturing it’s all about finding the right balance between push and demand, and even if the respective field has been found some (best) practices that leads to the middle way, in IT we still have to dig for it. The road seems to lead nowhere… but it helps getting a rough understanding of what a report involves, at least in what concerns the important non-programming stuff.

In the top down approach, as the data are remaining relatively constant, it’s easier to focus on the set of requirements and the reporting tool(s) used. It makes sense to choose the reporting tool that covers at least the most important requirements while for the other you have the choice to use workarounds or address them using two or more reporting solutions, attempting again to cover the most important requirements, and so on. The problem with having more than one reporting solution is that often data, logic and reports arrive to be replicated, the whole reporting infrastructure becoming more difficult to manage. On the other side as long you are having a clear (partitioned) delimitation between the reporting frameworks, logic and data are replicated at minimum, having two or more reporting solutions seems to be an acceptable trade. Examples of such delimitations are for example the OLTP vs. OLAP solutions, to which adds the data mining reporting solution(s).

There are also attempts to extend an existing reporting framework above the initial functionality, though often people arrived to reinvent the wheel or create little monsters they arrive to live with. Also this is an acceptable approach, as long the reporting framework allows that. Some attempts might succeed to provide what they were designed for, others not. There are also some good news from this perspective, the appearance of mash-ups and semantic technologies could in the future allow the integration of reporting systems, making possible things that nowadays are quite challenging. The future is open of unlimited possibilities, but better stick to the present! For that stay tuned for a second post!

Previous Post <<||>> Next Post

27 January 2010

🧭Business Intelligence: Cognos & SQL Server 2008 - The First Report (Test Drive)

Yesterday evening I installed on my computer the IBM Cognos Express (V9) software, the newest Cognos version, available with a trial period of 30 days. The first attempt to install it was not errorless though as I had to observe several attempts later when I got the same error, the error was not fatal. So, once the instalation complete, the Express Manager started, finding out that I have to install manually the Manager, Advisor, Reporter and Xcelerator - quite easy to do.

Now I was ready to test the tool and create several reports, for this needing to create first a data source - a connection to an existing database with several tables on which the reports will be based. SQL Server 2008's AdventureWorks database seemed to be adequate for this test, in plus I know there are many SQL Server professionals already using it for testing or tutorials.

Step 1: Creating the SQL Server account

Reporting tools need in general only read access to data sources and as a good practice to enforce security and address the special requirements the reports are coming with, so it makes sense to create a new User for this. Open Microsoft SQL Server Management Studio and from the root navigate to Security/New/Login…, this action brining Login – New window in which must be filled

a. the Login name (e.g. cognosuser)
b. the authentification – because Cognos doesn’t supports Windows-based authentication, choose SQL Server authentication, enter the Password and confirm it. By default the ‘Enforce password expiration’ is checked, though this option is not needed for the current requirement so unchek it.
c. Select the Default database, the database you’re intending to base your reports on (e.g. AdventureWorks)
d. Select the ‘Default Language’ (e.g. English)
Navigate to ‘Use Mapping’ tab and in ‘User mapped to this login’ choose the databases you would like to give access (e.g. AdventureWorks) and for each database selected you’ll have to select the schemas in scope (e.g. HumanResources, Person, Production, Purchasing and Sales), while in’Database role membership for’ select the db_datareader database level role name, then you can click ‘Ok’.
 
It’s always a good idea to test the new account before attempting to use it from an external application, this allows catching the eventual issues early in the process, saving your and others’ time!

Step 2: Creating a Data Source

Start the IBM Cognos Express Manager and from Data click on ‘Add…’ button, action that will bring the ‘Enter Data Source Information’ in which you’ll have to select the ‘Data Source Type’, in this case Microsoft SQL Server (SQL 2005 Native Client), fill the ‘Name’ (e.g. AdventureWorks), ‘Server Name’ (here comes your SQL Server name), ‘Database Name’ (e.g. AdventureWorks ), ‘User ID’ (e.g. cognosuser) and password (e.g. something you must guess). As can be seen from the below image, there is also a Databae Object control, which will be populated once you entered the before mentioned information and clicked on ‘Retrieve Database Objects’ button. Select one of the database objects (e.g. AdventureWorks ) and click ‘Ok’, Cognos creating the Data (Source) in the background.
 
Step 3: Creating a Report

Instead of using Report Studio, the standard tool for designing reports, for simple reports or queries could be used instead the Query Studio, an easy-to-use authoring tool, the creation of a report resuming to a simple drag and drop. For this you’ll have to open the Query Studio from the Launch item found on the top menu, this action opening a new browser window, in a next step following to select the package (actually the data source) you want to use. In ‘Insert Data’ mode select a table from the list (e.g. Location) and drag and drop it into the right section, all table’s columns being shown together with a sample set of data.

You could double click on each column and give it a meaningful name – is a good practice to separate the words with spaces though not really necessarily. By clicking on report’s Title you could change its actual value and add a Subtitle if necessary. In ‘Change Layout’ you could change report’s layout by applying different Font, Border and Conditional Styles, much like in Excel, show row numbers by clicking the ‘Show row numbers’ from Change Layout/Set Web Page Size, etc. Save the report once you’ve done the changes, the report will look something like the one from the below image:
Step 4: Running the Report

From ‘Run Report’ Menu item you could choose to run the report mainly in three modes – ‘Run with All Data’, ‘Preview with Limited Data’ or ‘Preview with No Data’. Additionally you can export the report to PDF (Portable Document Format), several versions of Excel (2000, 2002 and 2007), CSV (Comma Separated Values) and XML (eXtensible Markup Language).

Happy coding!

13 January 2010

🗄️Data Management: Data Quality Dimensions (Part II: Conformity)

Data Management
Data Management Series

Conformity or format compliance, as named by [1], refers to the extent data are in the expected format, each attribute being associated with a set of metadata like type (e.g. text, numeric, alphanumeric, positive), length, precision, scale, or any other formatting patterns (e.g. phone number, decimal and digit grouping symbols).

Because distinct decimal, digit grouping, negative sign and currency symbols can be used to represent numeric values, same as different date formats could be used alternatively (e.g. dd-mm-yyyy vs. mm-dd-yyyy), the numeric and date data types are highly sensitive to local computer and general applications settings because the same attribute could be stored, processed and represented in different formats. Therefore, it’s preferable to minimize the variations in formatting by applying the same format to all attributes having the same data type and, whenever is possible, the format should not be confusing. 

For example all the dates in a data set or in a set of data sets being object of the same global context (e.g. data migration, reporting) should have the same format, being preferred a format of type dd-mon-yyyy which, ignoring the different values the month could have for different language settings, it lets no space for interpretations (e.g. 01-10-2009 vs. 10-01-2009). There are also situations in which the constraints imposed by the various applications used restraints the flexibility of working adequately with the local computer formats.

If for decimal and dates there are a limited number of possibilities that can be dealt with, for alphanumeric values things change drastically because excepting the format masks that could be used during data entry, the adherence to a format depends entirely on the Users and whether they applied the formatting standards defined. In the absence of standards, Users might come with their own encoding, and even then, they might change it over time. 

The use of different encodings could be also required by the standards within a specific country, organization, or other type of such entity. All these together makes from alphanumeric attributes the most often candidate for data cleaning, and the business rules used can be quite complex, needing to handle each specific case. For example, the VAT code could have different length from country to country, and more than one encoding could be used reflecting the changes in formatting policy.

In what concerns the format, the alphanumeric attributes offer greater flexibility than the decimal and date attributes, and their formatting could be in theory ignored unless they are further parsed by other applications. However, considering that such needs change over time, it’s advisable to standardize the various formats used within an organization and use 'standard' delimiters for formatting the various chunks of data with a particular meaning within an alphanumeric attribute, fact that could reduce considerably the volume of overwork needed in order to cleanse the data for further processing. An encoding could be done without the use of delimiters, e.g. when the length of each chunk of data is the same, though chunk length-based formatting could prove to be limited when the length of a chunk changes.

Note:
Delimiters should be chosen from the characters that will never be used in the actual chunks of data or in the various applications dealing with the respective data. For example pipe (“|”) or semicolon (“;”) could be good candidates for such a delimiter though they are often used as delimiters when exporting the data to text files, therefore it’s better to use a dash (“-”) or even a combinations of characters (e.g. “.-.”) when a dash is not enough, while in some cases even a space or a dot could be used as delimiter.


Written: Jan-2010, Last Reviewed: Mar-2024 

References:
[1] David Loshin (2009) "Master Data Management", Morgan Kaufmann OMG Press. ISBN 978-0-12-374225-4.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
Koeln, NRW, Germany
IT Professional with more than 25 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.